Indiana Men’s Basketball Coach Analysis

Report Summary

Summary of the Problem Statement

The Indiana University Athletics Department previously commissioned a study to evaluate the performance of the men’s basketball program. Researchers found concerns in relation to the success the basketball program has posted in the past, to the performance of the program in its most recent years. Given the programs successful history, the limited number of NCAA tournament appearances and lack of conference championships are reason enough to question the direction of the program. With this being said, the athletic department feels strongly that the program needs to see improvement quickly. Thus, the athletic director for Indiana University has recommended a thorough data-driven evaluation of the team’s performance (in the past four seasons) in relation to the coaching change. The athletic director wants to know if there have been improvements in performance to see if the new coach is the right fit for the program. This evaluation includes the current head coach, Mike Woodson, as well as the last head coach, Archie Miller. Understand that Woodson took over for Miller, who had been acting head coach since March of 2017, in March of 2021. While Archie Miller’s coaching career spans from the 2017-18 season to the 2020-21 season, this study only includes Miller’s last two seasons as head coach. Given this the aim of this study is twofold:

Serve as objective feedback of the performance of the team to evaluate the effectiveness of the new coach in comparison to himself (in the year prior) as well as the coach before him.
Create a general recruiting profile for the Hoosiers and see if there have been any underlying changes between coaches for potential prospects.

Summary of the Data and Methodology

The research conducted was based on public NCAA Hoop data (supplied by the NCAA Hoop Repository) spanning from the 2019-20 seasons to the most recent 2022-23 season. The data was given by year for the team’s roster as well as by year for the schedule. Each year was then combined to create one large data file containing all players on the roster over the four seasons being analyzed. The same was done for the schedules, combining each year to create a list of all games scheduled over that same four-season period. However, the box score data was given by game. Thus, within each season each game had to be imported and then each season to create the box scores for all games for the past four seasons. Aside from accumulating the data in one place, it was well structured. Thus, data cleaning was not overly extensive, and the larger focus was on data manipulation throughout the study. Each manipulation is provided with a general description as well as explanation so that there is a general understanding on what took place.

Recall that this study was largely based on descriptive analytics (with the usage of data consolidation, bar graphs, tables, and line graphs) to come to many of the conclusions that will be discussed below. While limited, the use predictive analytics was also used. This was done to identify important variables in game outcome, as wins are one metric used to define overall success. Accordingly, the analyst aimed to find key components of the game (using logistic regression, classification trees, and random forests) that had the most influence on game outcome. From these results the analyst then compared the identified components across the different coaches as well as by year to see if improvement had taken place.

Athletic Director Implications

To conclude, analysis indicates that the Indiana University Men’s Basketball team has seen a considerable amount of improvement since the coaching change and even within the past two seasons (with Mike Woodson as head coach). The improvement is easily seen in the number of wins and the posted winning percentages over the course of the entire season, as well as within conference competition. As stated before, wins are one mark of success for sports, however notable improvement can be seen in several areas of play for the Hoosiers (both between coaches and successive seasons) as well. Thus, evaluation of Woodson results in asking him to return next season as head coach. Evidence supports that he could lead the program to more success next season.

The analysis also indicates that the recruiting styles of the coaches are nearly identical. Both coaches held the roster to the same size. In addition, there appears to be a larger number of opportunities available for both forwards and guards with respective heights of at least 6’4” and 6’6”. In addition, neither coach appears to have restrictions on recruiting players from any part of the country. Thus, any DI caliber athlete should have an opportunity to play in Bloomington.

Business Background

General Information and History

General Informaion and History

The Indiana University Athletics Department is looking to discover new information on the development of program performance in relation to their men’s basketball team. The Hoosiers are a NCAA Division I college basketball program out of Bloomington, Indiana and serve as a member of the Big Ten Conference. Notably, the men’s basketball program is amongst one of the most successful programs in NCAA history. The Hoosiers have won a total of five NCAA Championships, sitting only behind UCLA (12), Kentucky (8), and North Carolina (6) (also notable is Duke tied at 5). Amongst these five NCAA championships, the Hoosiers also hold the title as the last undefeated national championship team (occurring in 1976). The historical success of the program does not end there as an additional achievement for the Hoosiers is their NCAA tournament runners-up finish in 2002. Aside from national titles, the program has also amassed 22 Big Ten Conference Championships and 40 NCAA tournament appearances. The Hoosiers have also remained a relevant, high performing basketball program with their 31 preseason appearances in the AP Poll and 28 appearances in the Final AP Poll. These ranked appearances accumulating to an estimated total of 574 total weeks spent in the AP Poll. Although Indiana has had a history as a successful program, in recent seasons the team has struggled some and there are over-arching concerns that the program has taken steps back since its last national title run in 2002. Thus, with the hiring of new head Mike Woodson (as of March 28, 2021), the Indiana University athletic director has hired a sports analysis team to evaluate the team’s performance under the new coach to ensure this coaching change is leading the program in the right direction.

Indiana Athletics Mission Statement

“The spirit of Indiana in athletics must be the spirit of the team. The team must be competitive in spirit and have the will to win over and above the will to star… Without the spirit of the team and this goal of school above self, we fail miserably—not only here in our sports life, but in the world of business and society after we leave this campus… With it, we exemplify the true spirit of Indiana athletics.”

Business Problem

In any sport, the coaching staff is largely charged with the program’s performance outcomes. As stated above, Indiana basketball has a strong history of successful teams. However, the team’s performance in more current years has not lived up to the high standards forged in the past. Of course, the team has had notable accomplishments in the last 5 years, but the low points and mediocrity of the team seem to outweigh most of the positives. Thus, with the increased interest and use of analytics in sports, the athletic director has added additional staff to the team to evaluate the overall performance of the team in relation to the two different head coaches (Archie Miller and Mike Woodson) within the past four seasons. The additional staff members will look for patterns in play over the course of the seasons (in relation to each coach) to see if improvement is occurring for the program as expected. One of the two coaches being looked at is Archie Miller, who was named the 29th head coach of the Indiana Men’s Basketball program on March 25, 2017. He served as head coach for the Hoosiers until March 28, 2021. This is when the head coach in question, Mike Woodson, took over leading the Hoosiers. Understand that this analysis will look at the performance patterns occurring at the tail-end of Miller’s coaching career and compare them to the performance patterns occurring at the beginning of Woodson’s coaching career at Indiana. Since recently the Hoosiers have struggled so much, the athletic department does not want to waste any time and needs to see the program improve, hence why this analysis is taking place (so that coaching changes can occur if they need to, as quickly as possible).

All of the data for this project will be sourced from a public database known as the NCAA Hoop Dataset.

Ideally, results from this study will:

Serve as objective feedback of the team to evaluate the effectiveness of the new coach. Where, the analyst hopes to see improvement in overall wins, as well as within key components of the game.
Create a general recruiting profile for the Hoosiers and see if there have been any underlying changes between coaches for potential prospects.

By looking at performance, the athletic director can make a well-informed decision on whether the new head coach appears to be the right fit for this program. Additionally, from this analysis the coach will be able to take active and controllable steps to self-improvements within his team by learning about their tendencies and shortcomings.

Key Stakeholders

Key Stakeholder

With all of the business background components in mind, a list of key stakeholders has been generated below:

Student athletes and coaches (Primary)
The athletic department (Secondary)
The university/admissions office (Tertiary)
Students (Tertiary)
Staff and faculty (Tertiary)
City of Bloomington (Tertiary)
NBA (Quaternary)

Note the stakeholders are not limited to those listed above, but these will be the people/organizations/business most directly impacted.

Analytical Approach

Analytcial Approach

With the data available, the research and development team plan to look at the following data tables: Box Score Reports, Team Schedule, and Team Roster. For more information on these tables visit the data preparation phase. Majority of the analysis will delve deeper into the Hoosiers box scores and schedule, while supporting information will come from roster information. Note that most of the analysis will be done using descriptive analytics, which identifies what has already happened. The analyst will be looking at the schedule and historical box scores to gain insight on performance patterns of players and the team in two-year spans. The analyst will compare game to game performances of the team as well as consolidate these statistics for each coach (over the course of their respective seasons as head coach) over the last four years. The focus of the analyst will be on 5-7 performance variables that will be identified by their importance in relation to game outcome. All variables included in this study will be explained in more detail in the data understanding phase. By narrowing the focus, the analyst will have a more thorough and complete understanding of performances in areas of the game deemed most important. While descriptive analytics is not the only type of analytics that could be used, it will be the most insightful for the discoveries the analyst needs to make.

Expected Benefits

Analysis should provide a foundation for the athletic director to confidently state whether the current coach is a good candidate to continue to move the program in the right direction. There is no guarantee that analysis will bring national or conference titles to the Hoosiers, as individual game outcomes can be highly unpredictable. However, general marks of improvement should provide evidence of the formation of a more competitive and well-rounded team. Better stated, a team that highlights their programs’ historical attributes. The analyst looks to discover the strengths currently in the program, as well as identify weaknesses so that the coach can work on them, so they are no longer a large detriment to the team. In the identification of the details of the program, the staff and players will have a greater more objective understanding of themselves that will promote further competition and success. Other parts of the university will benefit from this too. The athletic department will benefit by continuing to put Indiana University on the map as a strong athletic program and drawing in potential recruits for basketball as well as other sports. The trickle-down effect of strong athletic programs will likely bring more students to campus due to the high interest of sports across the country. This will both impact the admissions office as well as the faculty/staff at Indiana University. This trickle-down effect would continue as population growth in Bloomington will likely have a positive effect on the economy (more people visiting, more people spending, etc.). Finally, the NBA could benefit from this by having player performance information on potential draft picks from this given university. Of course, if this proves to be successful at Indiana University, other colleges would want to be a part of this too.

Data Preparation

Origins of the Data Set

This NCAA Men’s Hoops data can be found at the hyperlink here.

NCAAhoopR is an R package for working with NCAA basketball play-by-play data. It automatically scrapes play-by-play data and returns it to the user in a more tidy, organized, and concise format. This provides an analyst the capability to analyze the data in the ways that best suits the research being conducted. The data has no specific original purpose other than to be readily available to do basketball analytics on. The men’s college basketball data spans from the 2005-2006 season to the most current 2022-2023 season.

Description of the Data Set

Box Scores- Includes the game recorded box-scores from each game during the season. Includes information on individual performances as well as team totals for the game(s). Variables include player name, minutes played, field goal information, three-point shot information, free throw information, rebounds, assists, steals, blocks, turnovers, personal fouls, overall points, and whether the player was a starter or not. This table will likely be the “core” table for the analysis, as it hosts the easily and readily available data on the game.
Pbp_logs- Includes play-by-play information on each game recorded. This table includes the time of the play occurrence, description of the play, score, win probability throughout the game, time outs remaining, and information on who has possession of the ball.
Rosters- Team rosters for each season. Includes information on each player. These consist of position, class, height, weight, and hometown/home state.
Schedules- A team’s schedule for a given season. Displays information on the teams played, the dates they were played, as well as location, score, and the team of interests’ record (both regular and conference).

Note: These descriptions are geared more toward a general description. More specific details about the tables in terms of the analysis to be done will be discussed in the tabs below.

Data Understanding

Data Understanding

The analysis team will only be using data sets from box_scores, rosters, and schedules for the Indiana basketball team. The data will be combined across four seasons spanning from 2019-2020 through the current 2022-2023 season. Information regarding each of the tables being used is included below. This also includes information on all of the variables present in the tables along with a description:

Indiana Rosters

Indiana Rosters

10 columns x 67 rows
Support table
Will be used to gather general information on the team. This will include things like number of players on the team, number of players per position, average height across positions, and geographical origins of players on the team.
This table does not include a target variable but will still provide insight on the men’s basketball team and useful information for the analysis to be conducted.

Indiana Schedules

Indiana Schedules

10 columns x 125 rows
Support table
Will be used to compare overall performance (in relation to W/L) across both head coaches. Used to see if there are any correlations in teams facing and overall team performance by coach. Point differential will also be derived from this table.
This table has a target variable, outcome. Since coaches job is to win, this will give the analyst an overarching view at how successful each coach was in the two year time span.

Indiana Box Scores

Indiana Box Scores

30 columns x 1,343 rows o Core table o Will be used to derive most of the information on program performance during games. This table will provide the team with the most actionable analysis and results. Mostly statistic-based answers will be derived from this table. o There are a lot of potential target variables labeled, but as stated in the introduction the team will look to narrow this down to about 5-7 of interest. All of the variables have been included in the data understanding so that there is an understanding of what the team had available to them. Understand that once variables are picked the reasoning for it will be included.

Data Cleaning

Data Cleaning

First all of the data sets were read into R.

library(readxl)

IU_Rosters <- read_excel("/Users/kamriefoster/Downloads/Indiana Rosters.xlsx")
IU_Schedule <- read_excel("/Users/kamriefoster/Downloads/Indiana Schedule.xlsx")
IU_BoxScores <- read_excel("/Users/kamriefoster/Downloads/Indiana Box Scores.xlsx")

Most of the data cleaning steps were done in excel before they were imported. Thus, a list of these cleaning steps has been included as a list for each data table in the tabs below.

Note: there may be additional data manipulation when analysis begins. These were not included as each of these steps were derived for a specific purpose that will be outlined. The steps below were done to clean the data set as a whole and serves as the starting point for any manipulation that happens from this point forward.

Indiana Rosters

Indiana Rosters

Added the variable season to the data set. This was done to distinguish the year-to-year rosters on the master roster list for this analysis project.
Added the playerid variable and generated unique numbers for each existing player on the roster. This was done to allow privacy on player information if the analysis wants to be shared but without regards to the individual who the information pertains to.
The height variable was originally stored in a feet’ inches” format. This was changed to an inch only format for easier use in R in finding average height.
The hometown column was originally stored in a hometown, home state format. This column was separated into hometown and home state columns in for geographical visualizations for recruiting purposes.

After these steps were done in excel a few data cleaning steps occurred in R. There were two steps of column removal. First, the column number was removed as it was just a representation of what jersey number each player wore and provides no insight to this analysis. Then, the created variable player_id was removed. Although this seems like extra work to include the variable to start with, the analyst did not know whether the athletic department would share the findings and want to uphold player confidentiality. If player confidentiality is the goal this column should not be removed and the column labeled player would have been. These two removals are displayed below:

IU_Rosters <- IU_Rosters[,-2] #removal of number
IU_Rosters <- IU_Rosters[,-2] #removal of player_id
IU_Rosters

## # A tibble: 67 × 8
##    season name                 position height weight class hometown   homestate
##     <dbl> <chr>                <chr>     <dbl>  <dbl> <chr> <chr>      <chr>    
##  1   2020 Cooper Bybee         G            73    185 JR    Ellettsvi… IN       
##  2   2020 Al Durham            G            76    185 JR    Lilburn    GA       
##  3   2020 Armaan Franklin      G            76    195 FR    Indianapo… IN       
##  4   2020 Justin Smith         F            79    230 JR    Buffalo G… IL       
##  5   2020 Trayce Jackson-Davis F            81    245 FR    Greenwood  IN       
##  6   2020 Michael Shipp        G            75    185 FR    Cincinnati OH       
##  7   2020 Rob Phinisee         G            73    190 SO    Lafayette  IN       
##  8   2020 Devonte Green        G            75    185 SR    North Bab… NY       
##  9   2020 Nathan Childress     F            78    210 FR    Zionsville IN       
## 10   2020 Adrian Chapman       G            74    190 SR    Brownsburg IN       
## # … with 57 more rows

Indiana Schedules

Indiana Schedules

Added the variable season to the data set. This was done to distinguish the year-to-year schedules on the master schedule list for this analysis project.
Created calculated field score_dif to represent the score differential at the conclusion of the game. Score differential is the difference in the Hoosiers score and their opponent’s score. A positive number means that the Hoosiers beat their opponent by that many points, while a negative number means that the Hoosiers lost to that opponent by that many points.
Split record column into the overall record and a conference record variable. Since these were both included in the record column, the team decided it would be easier to split them up and have the information contained in its own column.
Added the variable outcome to easily be able to count and distinguish the games that were won and the games that were lost by the Hoosiers.

Indiana Box Scores

Indiana Box Scores

Added the variable season to the data set. This was done to distinguish the year-to-year box scores on the master box score list for this analysis project.
Added game_id to this data table to have the ability to link it to the schedules if desired.
Added playerid to this data table to have the ability to link it to the roster if desired.
Added outcome to the data table to denote the games that the Hoosiers won and the games that the Hoosiers lost.
Player column was stored as first initial last name (i.e., J. Brunk) this was changed to match the roster and include their full name. This was done for consistency within the data.
Altered the column starter to be a true binary variable, where true is now represented by 1 and false is represented by zero. This will increase the usability of the column in R and/or Python.

The analyst also created a few calculated fields:

Created calculated field POSS using the formula:
Possessions = FGA – OREB + TOV +(0.4 x FTA)
This value was then rounded to whole numbers since a passion cannot be a fraction. In addition to this values that were negative were changed to zero since basketball does not have negative possessions. This variable has been estimated but from research has been found to be a good approximation. By using possessions as the marker, comparisons across seasons and even between players can be more standardized.
While not created in R the shooting percentages should still be noted in the data cleaning section.
Created calculated fields FGP, ThreePTP, and FTP using the ratio of the number made divided by the total number of attempts.
FG% = (FGM / FGA) x 100
3PT% = (3PTM / 3PTA) x 100
FT% = (FTM / FTA) x 100

More details and explanation of these variables and their creation will be included in the analysis portion they are created in. Note that these could not accurately be produced in R as there were undefined values that gave the analyst NA’s and produced inaccuracies.

After these steps were done in excel, the analyst then had to do some variable removal as was done in the Roster data set. First, the analyst removed the variable team from the data set. This variable would be beneficial if the analysis was being conducted on more than one team, but with Indiana being the only team it is useless. Then the analyst removed the game_id column. Originally the analyst did not know whether the analysis would need to link data tables which is why it was created in the first place. However, given the time constraints of the project the analyst decided to not do this and thus, the variable was removed. Similarly, as discussed before in the rosters section the variable player_id was deleted. Once again, if the athletic department wished to keep player information confidential this would allow this code to be reconfigured to keep that safe. However, for this particular analysis this was not necessary and thus the analyst removed the player_id column. These removals are demonstrated in the code below:

IU_BoxScores <-IU_BoxScores[,-1] #removal of the variable team
IU_BoxScores <- IU_BoxScores[,-2] #removal of game_id
IU_BoxScores <- IU_BoxScores[,-2] #removal of player_id

Libraries Needed for Analysis

After data cleaning was accomplished, the analyst then looked to load all of libraries that would be needed to complete the analysis. The list and their respictive reasoning for being loaded are included below:

library(readxl) - this package is used to read the data sets into R.
library(tidyverse) - this package is used to be able to manipulate the data. Can be used to filter, join, and group the data.
library(ggplot2) - used to be able to create bargraphs and other visualizations.
library(rpart) - used to be able to generate the classification tree.
library(rpart.plot) - used to create the visual of the classification tree.
library(randomForest) - used to generate the random forest model.
library(dplyr) - used to be able to filter and manipulate the data as needed.
library(cowplot) - used to enter pictures in a grid for easy comparison of models. library(ggplot2) - used to enter picture in a grid for easy comparison of models. library(magick) - used to enter picture in a grid for easy comparison fo models.

Then, the libraries were actually loaded into R using the code below:

#libraries needed for analysis
library(readxl)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.4     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggplot2)
library(rpart)
library(rpart.plot)
library(randomForest)

## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

library(dplyr)
library(cowplot)
library(ggplot2)
library(magick)

## Linking to ImageMagick 6.9.12.3
## Enabled features: cairo, fontconfig, freetype, heic, lcms, pango, raw, rsvg, webp
## Disabled features: fftw, ghostscript, x11

Roster Analysis

Roster Size

First, creating data sets that include information about the rosters by each individual season. This is done by filtering the data as shown below:

#looking at the IU Roster by each season
#recall that 2020 is the 2019-2020 season, 2021 is the 2020-2021 season, 
#2022 is the 2021-2022 season, and 2023 is the most recent 2022-2023 season.

IU_Roster2020 <- filter(IU_Rosters, season == 2020)
IU_Roster2021 <- filter(IU_Rosters, season == 2021)
IU_Roster2022 <- filter(IU_Rosters, season == 2022)
IU_Roster2023 <- filter(IU_Rosters, season == 2023)

The analyst wanted to start by getting an idea of the number of players the Hoosiers carry on their roster each season. The analyst set out to find if the new head coach was any different than the last coach in relation to number of players on the season per year.

count_players <- table(IU_Rosters$season)
barplot(count_players, main = "Size of Roster", xlab = "Season", ylim = c(0,20))

As shown in the figure above, the number of players on the team has stayed consistent the last few years at 16 and 17. It is safe to assume that the roster will stay around the same size due to scholarships and NCAA regulations. In men’s college sports it is common for incoming freshman to be red shirted in order to have an easier transition from high school to college and/or to develop their athletic abilities. In some cases players may be red shirted to give them time to recover from a pre-season injury. Thus, ensuring that he will have enough players to make it through the season.

Players by Position

After evaluating the number of players on the team, the analyst then wanted to gather information on the number of players within each position. By focusing on position, important recruiting information can be derived. Bar charts have been created below to display the comparison of number of players within each position.

#used to show all four graphs in one visual for easy comparison
par(mfrow = c(2,2))

#the bar graphs are made below for each season
count_class2020 <- table(IU_Roster2020$position)
barplot(count_class2020, main = "2020 Position Count", xlab = "Position", ylim = c(0,12))

count_class2021 <- table(IU_Roster2021$position)
barplot(count_class2021, main = "2021 Position Count", xlab = "Position", ylim = c(0,12))

count_class2022 <- table(IU_Roster2022$position)
barplot(count_class2022, main = "2022 Position Count", xlab = "Position", ylim = c(0,12))

count_class2023 <- table(IU_Roster2023$position)
barplot(count_class2023, main = "2023 Position Count", xlab = "Position", ylim = c(0,12))

One immediate observation the analyst had was the lack of centers on the team. In the 2019-2020 season, the Hoosiers did not have a single guy on the roster listed as the center position. Even in the remaining seasons, the Hoosiers consistently had a very low number of players listed as centers within their roster. This could largely be due to the fact that teams are looking for players who can be effective in the paint as well as the perimeter. Specifically between the 2021-2022 and 2022-2023 season, it appears that Woodson put emphasis to recruit more forwards to the team. In general, these charts show that regardless of coach the Hoosiers appear to have a larger number of opportunities available for guards and forwards.

Geographical Information

To collect more general information on recruitment of the Indiana basketball team, the analyst then looked at where each player in the roster was from. First, looking at the seasons that Archie Miller was coach (2019-20 and 2020-21), to see if there were any major differences from when Mike Woodson took over (2021-22 and 2022-23). This was done in Tableau using the geographical feature. Screenshots for comparison have been included below:

Archie Miller’s recruiting years are displayed on the top, while Mike Woodson’s recruiting years are on the bottom. These diagram shows that over the past 4 seasons there have been players from all over the country. Both visuals also appear to have a large number of recruits from the home state Indiana. However, it does not appear that either coach was or is limiting their recruitment to be outside of any range. In fact, it appears that Coach Mike Woodson has expanded his recruitment further than Archie Miller did with player additions from Texas, Kansas, Missouri, and Virginia. Recruitment efforts for Archie Miller appear to be more densly populated in Indiana and in neighboring states. Mike Woodson still has many players from the same area, but it is clear that the data points are less dense in and around Indiana. In short, it appears that any Division I caliber athlete will have an opportunity to play in Bloomington regardless of where they are from. Specifically under Mike Woodson, there appears to be a larger effort to recruit high prospects even if they are across the country.

Player Height

Next, the analyst gathered more specific information about the team. In basketball, height is a huge component of the game. There is and always has been a height advantage for teams in terms of shooting, blocking, and rebounding. With this in mind, the average height of the team per each coach was found.

Filtering the data for seasons Archie Miller was the head coach. Recall that this particular data set only has the past four seasons included, thus this filtering will only include roster information from the 2019-20 season as well as the 2020-21.

IU_RosterAM <- filter(IU_Rosters, season == "2020" | season == "2021")

Filtering the data for seasons Mike Woodson has been the head coach. This includes the 2021-22 season as well as the 2022-23 season.

IU_RosterMW <- filter(IU_Rosters, season == "2022" | season == "2023")

Looking at the average height by position for each coach. First start by grouping the data by position:

AM_by_position <- group_by(IU_RosterAM, position)
MW_by_position <- group_by(IU_RosterMW, position)

Now find the average height by position:

AM_avg_height_by_pos <- summarise(AM_by_position, type = mean(height, na.rm = TRUE))
AM_avg_height_by_pos

## # A tibble: 3 × 2
##   position  type
##   <chr>    <dbl>
## 1 C         83  
## 2 F         79.8
## 3 G         74.8

MW_avg_height_by_pos <- summarise(MW_by_position, type = mean(height, na.rm = TRUE))
MW_avg_height_by_pos

## # A tibble: 3 × 2
##   position  type
##   <chr>    <dbl>
## 1 C         82.3
## 2 F         79.4
## 3 G         76

As seen above, in terms of height, there is nearly no difference in recruiting between the two coaches. The only position that contributed over an inch change was the forwards with Mike Woodson increasing the average by just over an inch. Understand that the data is limited and with more years per each coach a larger difference could be seen. However, in regards to the data available the following observations can be made for the Hoosiers in regards to the last 4 seasons:

Centers are the tallest players on the team at around 6’10”.
Forwards were 6’6” with Archie Miller and now are around 6’7” with Mike Woodson.
Guards are the smallest players on the team at around 6’4”.

Since recruitment has been similar in the past four seasons a summary was generated to see both the smallest and the tallest players on the team. This allows the analyst to continue to build a general recruiting profile to give an idea of preferred player size of the Hoosiers.

summary(IU_Rosters)

##      season         name             position             height     
##  Min.   :2020   Length:67          Length:67          Min.   :73.00  
##  1st Qu.:2021   Class :character   Class :character   1st Qu.:75.00  
##  Median :2022   Mode  :character   Mode  :character   Median :77.00  
##  Mean   :2022                                         Mean   :77.43  
##  3rd Qu.:2022                                         3rd Qu.:79.00  
##  Max.   :2023                                         Max.   :84.00  
##      weight         class             hometown          homestate        
##  Min.   :185.0   Length:67          Length:67          Length:67         
##  1st Qu.:189.0   Class :character   Class :character   Class :character  
##  Median :205.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :209.3                                                           
##  3rd Qu.:226.5                                                           
##  Max.   :255.0

The generated summary shows that the smallest player on the roster is listed at 6’1”. It is clear that the chances of being recruited to Indiana University increase as a player’s height increases, because of the advantage it brings to the game. However, a smaller player could be recruited if they demonstrate the appropriate skill level.

Schedule Analysis

Overall Win/Losses

After developing a good understanding of the roster similarities and differences for Coach Miller and Coach Hood, the analyst looked into learning more about their overall success. For many coaches, success is denoted by the number of wins and the number of losses. With this in mind, the analyst looked at each coaches overall record (in their respective past 2 seasons). This was done by using the accumulated schedules of the team over the past 4 seasons and simply looking at the total number of wins and comparing them directly to the number of losses. Ideally, a team should have more wins than losses to be considered quality and competitive. In addition, Mike Woodson and the AD hope to see an increase in the number of overall wins in his last two seasons to allude that the program is becoming more successful under his direction.

First, start by recreating the respective schedules for each coach.

IU_ScheduleAM <- filter(IU_Schedule, season == "2020" | season == "2021")
IU_ScheduleAM

## # A tibble: 59 × 12
##    season   game_id date                opponent   location team_score opp_score
##     <dbl>     <dbl> <dttm>              <chr>      <chr>         <dbl>     <dbl>
##  1   2020 401166018 2019-11-05 00:00:00 Western I… H                98        65
##  2   2020 401166053 2019-11-09 00:00:00 Portland … H                85        74
##  3   2020 401166059 2019-11-12 00:00:00 North Ala… H                91        65
##  4   2020 401166073 2019-11-16 00:00:00 Troy       H               100        62
##  5   2020 401166082 2019-11-20 00:00:00 Princeton  H                79        54
##  6   2020 401166096 2019-11-25 00:00:00 Louisiana… H                88        75
##  7   2020 401166102 2019-11-30 00:00:00 South Dak… H                64        50
##  8   2020 401168234 2019-12-03 00:00:00 Florida S… H                80        64
##  9   2020 401166106 2019-12-07 00:00:00 Wisconsin  A                64        84
## 10   2020 401169461 2019-12-10 00:00:00 UConn      N                57        54
## # … with 49 more rows, and 5 more variables: score_dif <dbl>, outcome <chr>,
## #   record <chr>, `conference record` <chr>, streak <chr>

IU_ScheduleMW <- filter(IU_Schedule, season == "2022" | season == "2023")

Now, generate bar graphs to compare the two coaches overall records.

#used to show all four graphs in one visual for easy comparison
par(mfrow = c(1,2))

#compute the counts for the number of wins and the number of losses.
count_outcomesAM <- table(IU_ScheduleAM$outcome)
#compute the counts for the number of wins and the number of losses.
count_outcomesMW <- table(IU_ScheduleMW$outcome)

barplot(count_outcomesAM, main = "Coach Miller Overall Wins versus Losses", ylim = c(0,50))
barplot(count_outcomesMW, main = "Coach Woodson Overall Wins versus Losses", ylim = c(0,50))

While in two seasons both coaches posted a winning percentage over 50%, it is clear in the visual that Mike Woodson has been more successful in his first two season than Archie Miller was in his last two seasons. To answer the question “How much more successful?” the analyst looked at the actual count for both coaches.

#win/loss total for Archie Miller's past two seasons
count_outcomesAM

## 
##  L  W 
## 27 32

#win/loss total for Mike Woodson's first two seasons 
count_outcomesMW

## 
##  L  W 
## 24 42

Archie Miller’s record was found to be 32-27, which posts a winning percentage of 54.2%. While Mike Woodson has increased the Hoosiers record to 42-24 the last two seasons, posting a winning percentage of 63.6%. That is almost a 10% increase in the amount of wins the Hoosiers had. This is a promising sign that with Mike Woodson as head coach, the Hoosiers are headed in the right direction.

Win/Loss Record by Season

The analyst then wanted to look at how the Hoosiers did each season of play. While the overall record shows that the 2021-22 season and the 2022-23 season was overall better than the 2019-20 and 2020-21 seasons, analyzing by season will allow the analyst to see if the records are improving or declining with a particular coach leading the team. Start by filtering the schedule data by each given season.

IU_Schedule2020 <- filter(IU_Schedule, season == 2020)
IU_Schedule2021 <- filter(IU_Schedule, season == 2021)
IU_Schedule2022 <- filter(IU_Schedule, season == 2022)
IU_Schedule2023 <- filter(IU_Schedule, season == 2023)

Now, create the bar charts to compare the number of wins and losses with in each season of this recruiting cycle.

#used to show all four graphs in one visual for easy comparison
par(mfrow = c(2,2))

#the bar graphs are made below for each season
count_outcome2020 <- table(IU_Schedule2020$outcome)
barplot(count_outcome2020, main = "2019-20 Outcome Totals", xlab = "Outcome", ylim = c(0,25))

count_outcome2021 <- table(IU_Schedule2021$outcome)
barplot(count_outcome2021, main = "2020-21 Outcome Totals", xlab = "Outcome", ylim = c(0,25))

count_outcome2022 <- table(IU_Schedule2022$outcome)
barplot(count_outcome2022, main = "2021-22 Outcome Totals", xlab = "Outcome", ylim = c(0,25))

count_outcome2023 <- table(IU_Schedule2023$outcome)
barplot(count_outcome2023, main = "2022-23 Outcome Totals", xlab = "Outcome", ylim = c(0,25))

In the above visual, the two seasons that Archie Miller was coach are displayed on top and the two seasons that Mike Woodson was coach are displayed on the bottom. Observations from the bar graphs include:

The 2020-21 season was the worst season the Hoosiers have had in the past four seasons. It is the only season with a losing record and was the last season that Archie Miller was the coach for Indiana.
The 2021-22 and 2022-23 seasons display nearly identical number of wins, however the the 2022-23 seasons clearly shows a decrease in the total number of losses.
The 2019-20 season and the 2021-22 seasons display very similar overall outcomes. However, it is clear that by far the 2022-23 season posted the best results for the team.

Similar to before, the analyst wanted to compare win percentages across the seasons:

count_outcome2020

## 
##  L  W 
## 12 20

count_outcome2021

## 
##  L  W 
## 15 12

count_outcome2022

## 
##  L  W 
## 14 21

count_outcome2023

## 
##  L  W 
## 10 21

This information was placed in a table for easy comparison and organization.

In the data set that the analyst was working with, Archie Miller’s first year was better than Mike Woodson’s. However, the opposite was true of the successive year. Miller saw an 18.1% decrease in wins, while Woodson saw a 7.7% increase. Although Mike Woodson’s first year as head coach was not as successful (in terms of wins) as Archie Miller’s in this data set, Mike Woodson led the Hoosiers from their worst season (44.4%) to a near 15.6% increase in wins to 60%. He continued to push the program this past season with a 67.7% winning percentage. This beats any of the other seasons for the Hoosiers by nearly 5%. As such, there is evidence that Woodson has taken steps to put the basketball program in the right direction.

Overall Win/Losses by Location

Win/Losses by Location

After gathering general information on seasonal wins/losses per each coach, the analyst aimed to find trend in outcomes for the team in relation to location of the game. As basketball is one of the main contributers for sports in revenue, it is important to pack the stadium with fans in order to sell tickets. Also, as nearly half of the games are played at different locations the analyst wanted to see if both coaches could be competitive when competing at another school. It is important to remember that game outcome is the underlying factor to success for the team, but other components of the game should be considered before making any outright conclusion on the coaches. It is understood that a team can play a “good” game and not have the outcome go their way. First, start by filtering the data accordingly:

#filtering the data by location for Archie Miller
IU_ScheduleAM_H <- filter(IU_ScheduleAM, location == "H")
IU_ScheduleAM_A <- filter(IU_ScheduleAM, location == "A")
IU_ScheduleAM_N <- filter(IU_ScheduleAM, location == "N")

#filtering the data by location for Mike Woodson
IU_ScheduleMW_H <- filter(IU_ScheduleMW, location == "H")
IU_ScheduleMW_A <- filter(IU_ScheduleMW, location == "A")
IU_ScheduleMW_N <- filter(IU_ScheduleMW, location == "N")

Now, display the differences in game outcome by location for each coach.

#used to show all six graphs in one visual for easy comparison
par(mfrow = c(3,2))

#the bar graphs for home games 
count_outcomeAM_H <- table(IU_ScheduleAM_H$outcome)
barplot(count_outcomeAM_H, main = "Coach Miller Home Game Outcome Totals", xlab = "Outcome", ylim = c(0,30))

count_outcomeMW_H <- table(IU_ScheduleMW_H$outcome)
barplot(count_outcomeMW_H, main = "Coach Woodson Home Game Outcome Totals", xlab = "Outcome", ylim = c(0,30))

#the bar graphs for away games 
count_outcomeAM_A <- table(IU_ScheduleAM_A$outcome)
barplot(count_outcomeAM_A, main = "Coach Miller Away Game Outcome Totals", xlab = "Outcome", ylim = c(0,30))

count_outcomeMW_A <- table(IU_ScheduleMW_A$outcome)
barplot(count_outcomeMW_A, main = "Coach Woodson Away Game Outcome Totals", xlab = "Outcome", ylim = c(0,30))

#the bar graphs for neutral games 
count_outcomeAM_N <- table(IU_ScheduleAM_N$outcome)
barplot(count_outcomeAM_N, main = "Coach Miller Neutral Game Outcome Totals", xlab = "Outcome", ylim = c(0,30))

count_outcomeMW_N <- table(IU_ScheduleMW_N$outcome)
barplot(count_outcomeMW_N, main = "Coach Miller Neutral Game Outcome Totals", xlab = "Outcome", ylim = c(0,50))

From the visual the following observations can be made:

Both coaches have a better record at home than at any other location. However, in the course of two seasons it is clearly demonstrated that Coach Woodson has more wins and less losses than Archie Miller did in his two seasons.
Both coaches struggle to win away games. It is clear that the number of losses for both Woodson and Miller are identical, however the number of wins achieved by Woodson in two years surpasses the number of wins achieved by Miller in the past two seasons.
Neutral games are much harder to draw conclusions on as the number of games played at these locations the past four years are very limited. However, both coaches appear to post winning records.

Seasonal Win/Losses by Location

After drawing conclusions about location overall for each coach, the analyst started to look at the coaches individual seasons to compare them against each other. Note that neutral site games have been removed, since there isn’t enough data on hand. First, looking at the differences over the two years for Archie Miller. Start by filtering the data:

#2019-2020 Season
IU_Schedule2020_H <- filter(IU_Schedule2020, location == "H")
IU_Schedule2020_A <- filter(IU_Schedule2020, location == "A")

#2020-2021 Season
IU_Schedule2021_H <- filter(IU_Schedule2021, location == "H")
IU_Schedule2021_A <- filter(IU_Schedule2021, location == "A")

Now, create the bar graphs to compare each outcome location by season:

#used to show all four graphs in one visual for easy comparison
par(mfrow = c(2,2))

#bar graphs for home games 
count_outcome2020_H <- table(IU_Schedule2020_H$outcome)
barplot(count_outcome2020_H, main = "2019-20 Home Game Outcome Totals", xlab = "Outcome", ylim = c(0,20))

count_outcome2021_H <- table(IU_Schedule2021_H$outcome)
barplot(count_outcome2021_H, main = "2020-21 Home Game Outcome Totals", xlab = "Outcome", ylim = c(0,20))

#bar graphs for away games 
count_outcome2020_A <- table(IU_Schedule2020_A$outcome)
barplot(count_outcome2020_A, main = "2019-20 Away Game Outcome Totals", xlab = "Outcome", ylim = c(0,20))

count_outcome2021_A <- table(IU_Schedule2021_A$outcome)
barplot(count_outcome2021_A, main = "2020-21 Away Game Outcome Totals", xlab = "Outcome", ylim = c(0,20))

Observations from the bar graphs include:

A large decrease in the number of wins at home games from the 2019-20 season to the 2020-21 season. The Hoosiers went from winning nearly three times the number of games they lost at home to having as many wins as losses in the 2020-21 season.
Away games appeared to be consistent in both the number of wins and losses over each season. Note that the number of away losses in 2020-21 is less than the number in the season prior. This marks as improvement for Archie Miller, but not by much.

In general the 2020-21 season under Archie Miller, appeared to be a struggle for the Hoosiers regardless of location.

Next, the analyst looked to do the same to evaluate Mike Woodson. Start by filtering the data as necessary:

#2019-2020 Season
IU_Schedule2022_H <- filter(IU_Schedule2022, location == "H")
IU_Schedule2022_A <- filter(IU_Schedule2022, location == "A")

#2020-2021 Season
IU_Schedule2023_H <- filter(IU_Schedule2023, location == "H")
IU_Schedule2023_A <- filter(IU_Schedule2023, location == "A")

Now, generate the bar graphs for easy visual comparison:

#used to show all four graphs in one visual for easy comparison
par(mfrow = c(2,2))

#bar graphs for home games 
count_outcome2022_H <- table(IU_Schedule2022_H$outcome)
barplot(count_outcome2022_H, main = "2021-22 Home Game Outcome Totals", xlab = "Outcome", ylim = c(0,20))

count_outcome2023_H <- table(IU_Schedule2023_H$outcome)
barplot(count_outcome2023_H, main = "2022-23 Home Game Outcome Totals", xlab = "Outcome", ylim = c(0,20))

#bar graphs for away games 
count_outcome2022_A <- table(IU_Schedule2022_A$outcome)
barplot(count_outcome2022_A, main = "2021-22 Away Game Outcome Totals", xlab = "Outcome", ylim = c(0,20))

count_outcome2023_A <- table(IU_Schedule2023_A$outcome)
barplot(count_outcome2023_A, main = "2022-23 Away Game Outcome Totals", xlab = "Outcome", ylim = c(0,20))

Observations from the bar graphs include:

There was a slight increase in the number of wins and a slight decrease in the number of losses from the 2021-22 season to the 2022-23 season for home games for the Hoosiers. While the difference is only a game or two, it still marks improvement for the team across seasons.
Away games still contribute more losses than wins for Coach Woodson, however it is clear that the separation between the number of losses and wins has decreased in the past years and he is narrowing the gap to be more successful on the road.

In general, the Hoosiers have seen improvement under the guidance of Mike Woodson regradless of game location. They are being effective and finding more ways to win on the road.

Score Average Comparison

After looking at general location information between the coaches, the analyst looked to compare mean scoring information over the course of the two seasons for each coach. First, start by getting the overall averages for Archie Miller.

mean(IU_ScheduleAM$team_score)

## [1] 70.64407

mean(IU_ScheduleAM$opp_score)

## [1] 67.9322

mean(IU_ScheduleAM$score_dif)

## [1] 2.711864

Now, find the averages over each respective season Archie Miller was coach in the past four season.

aggregate(cbind(team_score, opp_score, score_dif) ~ season, data = IU_ScheduleAM, FUN = mean, na.rm = TRUE)

##   season team_score opp_score score_dif
## 1   2020    71.4375  66.71875 4.7187500
## 2   2021    69.7037  69.37037 0.3333333

Now, the same will be done for Mike Woodson starting with his overall averages.

mean(IU_ScheduleMW$team_score)

## [1] 72.89394

mean(IU_ScheduleMW$opp_score)

## [1] 67.24242

mean(IU_ScheduleMW$score_dif)

## [1] 5.651515

Then, the averages over each respective season Mike Woodson was coach over the past four seasons.

aggregate(cbind(team_score, opp_score, score_dif) ~ season, data = IU_ScheduleMW, FUN = mean, na.rm = TRUE)

##   season team_score opp_score score_dif
## 1   2022   70.80000  66.17143  4.628571
## 2   2023   75.25806  68.45161  6.806452

The data will now be presented in a table for easier reading and comparison.

Observations from this table include:

The two coaches both hold positve score differentials, which means on average their teams are outscoring their opponents. However, Mike Woodson’s teams clearly outperform Archie Miller’s by a score differntial of 3.
Archie Miller and Mike Woodson had nearly identical outputs in their first years within this data set.
The largest difference occurs in each coaches second year in the data set. Archie Miller’s team was not as successful on average only outscoring opponents by 0.3 points. However, Mike Woodson saw improvement increasing the score differential by over 2 points.

Once again, Mike Woodson appears to be leading the program in the right direction.

Score Differential Time Series Information

Score Differetial Time Series Information

After looking at the scoring averages, the analyst decided to look a little closer to score differential. While points scored is important, it is not necessarily relevant unless the analyst takes the other team’s score into account. For example, if the Hoosiers score a lofty 100 points in a game it would not mean much if the other team scored 101. So, the analyst looked to compare score differential trends for each coach over the course of the season. First, this was done by constructing a bar graph that averaged each coaches two years by each month (November through March).

p1 <- ggdraw() + draw_image("/Users/kamriefoster/Desktop/Coach Wood Analysis/Miller Score Dif Averages.png", scale = 1)
p2 <- ggdraw() + draw_image("/Users/kamriefoster/Desktop/Coach Wood Analysis/Woodson Score Dif Averages.png", scale = 1)

plot_grid(p1, p2, ncol = 2)

The following observations can be seen from the side by side comparisons of score differential by month:

Both coaches appear to start the season off well above their opponents in scoring. November appears to be the best month for the Hoosiers as the score differential for both caoches is well above 20 points.
As the season proceeds both coaches lose that separation between their points and their opponents. December is still positive for both coaches, however Coach Miller is much less than that of Coach Woodson. * The largest difference comes in the month of January where the average score differential for Miller was negative and Coach Woodson remains positive.
The coaches appear to both equally struggle in the month of February, with the largest negative score differential for both coaches.
While both coaches are still negative in score differential in March, Woodson appears to struggle more so at the end of the season. This is not a great observation as the most crucial part of the season is in March with the March madness tournament and a chance for a national title.

After looking at general trends, the analyst then looked to break the score differential down by each individual season to see if there were any identifiable trends. A chart displaying the time series information is shown below:

p1 <- ggdraw() + draw_image("/Users/kamriefoster/Desktop/Coach Wood Analysis/2019-20 Season Dif.png", scale = 1)
p2 <- ggdraw() + draw_image("/Users/kamriefoster/Desktop/Coach Wood Analysis/2020-21 Season Dif.png", scale = 1)
p3 <- ggdraw() + draw_image("/Users/kamriefoster/Desktop/Coach Wood Analysis/2021-22 Season Dif.png", scale = 1)
p4 <- ggdraw() + draw_image("/Users/kamriefoster/Desktop/Coach Wood Analysis/2022-23 Season Dif.png", scale = 1)

plot_grid(p1, p2, p3, p4, ncol = 2)

The time series for each season all show a similar descending trend line in score differential. This appears to be an ongoing problem for the team regardless of who is coaching. The analyst looked to see if the intercept value occurred later in the year for each of the seasons:

2019-20 occurred at the beginning of February
2020-21 occurred at in the middle of January
2021-22 occurred at the beginning of February
2022-23 occurred at the beginning of February

This appears to show that there really was no difference in the trend lines other than in the 2020-21 season. Thus, when comparing the two coaches, this does not appear to offer any viable differences in regards to score differential as the season progresses.

Specific Opponent Information

After looking at score differential information, the analyst then wanted to compare opponents played between each of the coaches. The analysis is more comparable if there is a control variable. Thus, if the analyst finds common opponents then the data between the two will be more comparable.

#used to show all four graphs in one visual for easy comparison
par(mfrow = c(1,2))

count_opponentsAM <- table(IU_ScheduleAM$opponent)
barplot(count_opponentsAM, las = 2, cex.names = 0.3, ylab = "Number of Times Played", main = "Archie Miller Opponents")

count_opponentsMW <- table(IU_ScheduleMW$opponent)
barplot(count_opponentsMW, las = 2, cex.names = 0.3, ylab = "Number of Times Played", main = "Mike Woodson Opponents")

The visuals above display that the common teams between the two coaches are conference games. Thus, it may be necessary to filter the data to only host data from the teams within the Big Ten Conference. These teams include:

Illinois
Iowa
Maryland
Michigan
Michigan State
Minnesota
Nebraska
Northwestern
Ohio State
Penn State
Purdue
Rutgers
Wisconsin

Conference Game Outcomes

With having a control group in mind, the analyst then wanted to look at each in-conference team played and derive specific information per coach. First, filter the data to only include information on the Big Ten teams for each coach.

#Archie Miller data by in-conference teams
IU_ScheduleAM_conf <- filter(IU_ScheduleAM, opponent == "Illinos" | opponent == "Iowa" |  opponent == "Maryland" | opponent == "Michigan" | opponent == "Michigan State" | opponent == "Minnesota" | opponent == "Nebraska" | opponent == "Northwestern" | opponent == "Ohio State" | opponent == "Penn State" | opponent == "Purdue" | opponent == "Rutgers" | opponent == "Wisconsin" | opponent == "Illinois")

#Mike Woodson data by in-conference team 
IU_ScheduleMW_conf <- filter(IU_ScheduleMW, opponent == "Illinos" | opponent == "Iowa" |  opponent == "Maryland" | opponent == "Michigan" | opponent == "Michigan State" | opponent == "Minnesota" | opponent == "Nebraska" | opponent == "Northwestern" | opponent == "Ohio State" | opponent == "Penn State" | opponent == "Purdue" | opponent == "Rutgers" | opponent == "Wisconsin" | opponent == "Illinois")

After filtering the data, find the overall conference records for each coach in the two seasons of data available:

#Archie Miller conference record
AMconf_outcome_count <- count(IU_ScheduleAM_conf, outcome)
AMconf_outcome_count

## # A tibble: 2 × 2
##   outcome     n
##   <chr>   <int>
## 1 L          24
## 2 W          17

#Mike Woodson conference record
MWconf_outcome_count <- count(IU_ScheduleMW_conf, outcome)
MWconf_outcome_count

## # A tibble: 2 × 2
##   outcome     n
##   <chr>   <int>
## 1 L          20
## 2 W          23

Archie Miller had an overall conference record of 17-21. His in-conference win percentage was 44.7%. In contrast, Mike Woodson posted a conference record of 20-19, resulting in a 51.3% win percentage. Clearly, the first two years of coaching for Mike Woodson were more successful in terms of outcome than Archie Miller’s last two years.

The analyst then wanted to derive specific information on the coach’s record against each opponent. This was done using Tableau. The generated bar graphs are included below:

Observations from the visual include:

In his last two seasons as head coach with the Hoosiers, there were a number of teams that Coach Miller had a losing records against. These teams included Illinois, Maryland, Michigan, Michigan State, Ohio State, Purdue, Rutgers, and Wisconsin. Amongst these teams were 5 teams that the Hoosiers could not seem to beat. These were Illinois, Michigan, Purdue, Rutgers, and Wisconsin.
In his last two seasons as head coach with the Hoosiers, Coach Miller had a winning record against Iowa, Minnesota, Nebraska, Northwestern, and Penn State. Amongst these teams there were three that Miller was able to win against every time they competed. These three teams included Iowa, Minnesota, and Nebraska.

Observations from the visual include:

In his first two seasons as head coach with the Hoosiers, Coach Woodson had losing records against the following teams:Iowa, Michigan State, Northwestern, Penn State, Rutgers, and Wisconsin. Amongst these teams were two teams that the Hoosiers could not seem to beat under Woodson and these were Iowa and Northwestern.
In his first two seasons as head coach with the Hoosiers, Coach Woodson had a winning record against Illinois, Maryland, Michigan, Minnesota, Nebraska, Ohio State, and Purdue. Amongst these teams there were just two that Woodson was able to win against every time they competed. These two teams were Minnesota and Nebraska.

After observing and analyzing the generated tables, it was time to compare the findings between the two coaches:

Miller had a losing record against 8 of the 13 conference teams. Including, 5 teams that remained unbeaten in his last two seasons as head coach. In comparison, Woodson posted a losing record against 6 of the 13 teams. Including, just two teams that remain unbeaten. For Miller these teams included: Illinois, Michigan, Purdue, Rutgers, and Wisconsin. However, for Woodson these included Iowa and Northwestern. It is interesting to see that the teams that were unbeaten in each two year time span were different for both coaches.
Conversely, Miller had a winning record against 5 of the 13 conference teams. This including 3 teams that Miller was able to beat every time. Coach Woodson on the other hand, had a winning record competing against 7 of the conference teams. Woodson beat just 2 teams every time. Miller’s perfect record teams include Iowa, Minnesota, and Nebraska, while Woodson’s perfect record teams also included Minnesota and Nebraska.

It appears that Woodson’s two years were more successful in conference play in comparison to Miller’s last two years. Although one team went from being beat every time under Miller, to beating the Hoosiers every time under Woodson the overall number of winning records for Woodson was well above that of Miller. Once again, alluding to improvement within the program.

Conference Score Differential Time Series

Similar to before, the analyst then looked to see how each coach performed over the course of the season within conference play. While all conference play is important, the end of the season is where conference titles and automatic tournament births are granted. Tablea was used to create the following time series visuals in relation to average score differential by month for in conference games.

It is clear from the visuals that Coach Miller’s teams struggled in conference play no matter which part of the season the games occured. Particularly, the beginning of conference play (in December) appeared to be the worst for Archie Miller. Although the team generally saw improvement throughout the course of the season, they could never seem to outscore their opponents always having a negative score differential (the other team was scoring more points than there own). This was then compared to Mike Woodson’s teams:

In contrast, Mike Woodson’s teams have had relative success throughout the season with in conference games. In fact, in three out of the four months of conference play the Hoosiers posted a postive point differential against their opponents, which was much different than the outcomes they had gotten under Miller. However, there still does appear to be a period of struggle under Woodson that is marked by February. Although, this visual demonstrates there are still short comings within the program, there is significant proof that the Hoosiers are improving their play under the current head coach.

This was then taken one step further to compare score differential by date in each of the seasons.

p1 <- ggdraw() + draw_image("/Users/kamriefoster/Desktop/Coach Wood Analysis/Score Dif Conf 2019-20.png", scale = 1)
p2 <- ggdraw() + draw_image("/Users/kamriefoster/Desktop/Coach Wood Analysis/Score Dif Conf 2020-21.png", scale = 1)
p3 <- ggdraw() + draw_image("/Users/kamriefoster/Desktop/Coach Wood Analysis/Score Dif Conf 2021-22.png", scale = 1)
p4 <- ggdraw() + draw_image("/Users/kamriefoster/Desktop/Coach Wood Analysis/Score Dif Conf 2022-23.png", scale = 1)

plot_grid(p1, p2, p3, p4, ncol = 2)

The following observations can be observed:

The 2019-20 season had the worst start in score differential than any other season. It is the only season that has the trend line starting in the negatives whcih eventually broke even at the beginning of February. It is the only positive trend line but not necessarily the best season.
The 2020-21 season had the worst overall score differential as the trend line has an intercept at the beginning of January and continues to get more negative as the season proceeds.
The 2021-22 and 2022-2023 seasons both display downward trend lines, however they do not proceed as negative until mid February, which is near the end of the season.

Overall, the two graphs on the bottom display a better trend in score differential than the top two, displaying that in the past two years the Hoosiers have been improving under Mike Woodson.

Conference Score Differential by Opponent

Now look at general scoring trends per opponent for each coach, starting with Archie Miller.

aggregate(cbind(team_score, opp_score, score_dif) ~ opponent, data = IU_ScheduleAM_conf, FUN = mean, na.rm = TRUE)

##          opponent team_score opp_score  score_dif
## 1        Illinois   65.66667  70.33333  -4.666667
## 2            Iowa   79.00000  70.33333   8.666667
## 3        Maryland   66.00000  69.00000  -3.000000
## 4        Michigan   61.00000  81.00000 -20.000000
## 5  Michigan State   65.33333  68.33333  -3.000000
## 6       Minnesota   74.00000  65.00000   9.000000
## 7        Nebraska   87.75000  76.00000  11.750000
## 8    Northwestern   70.66667  70.66667   0.000000
## 9      Ohio State   61.33333  66.66667  -5.333333
## 10     Penn State   68.00000  69.66667  -1.666667
## 11         Purdue   59.50000  69.75000 -10.250000
## 12        Rutgers   58.25000  67.00000  -8.750000
## 13      Wisconsin   64.33333  74.66667 -10.333333

aggregate(cbind(team_score, opp_score, score_dif) ~ opponent, data = IU_ScheduleMW_conf, FUN = mean, na.rm = TRUE)

##          opponent team_score opp_score score_dif
## 1        Illinois   68.25000  67.50000  0.750000
## 2            Iowa   77.00000  86.00000 -9.000000
## 3        Maryland   65.66667  61.66667  4.000000
## 4        Michigan   68.25000  70.75000 -2.500000
## 5  Michigan State   69.33333  75.00000 -5.666667
## 6       Minnesota   72.66667  65.33333  7.333333
## 7        Nebraska   75.66667  63.66667 12.000000
## 8    Northwestern   65.33333  69.00000 -3.666667
## 9      Ohio State   74.00000  67.00000  7.000000
## 10     Penn State   66.00000  67.66667 -1.666667
## 11         Purdue   73.25000  69.75000  3.500000
## 12        Rutgers   59.00000  63.00000 -4.000000
## 13      Wisconsin   63.66667  61.00000  2.666667

Observations between the two data tables include:

Archie Miller’s teams had a negative average score differential against 9 teams and a 0 difference with 1. Mike Woodson’s teams had a negative average score differential against just 6 teams. The team with the lowest average was Iowa at -9, in comparison to Miller’s -20 differnce against Michigan.
Archie Miller’s teams had a positive average score differential against just 3 teams, which are all teams that he posted a perfect record against. The largest differential was 11.75 when the Hoosiers played Nebraska. Mike Woodson posted a positive differential against 7 teams. Similarly to Miller posting his highest differential against Nebraska at exactly 12.
The range of values between Woodson’s teams is much smaller than that of Miller’s team. This could clue the analyst on the team’s ability to be more consistent in their overall play. Woodson’s averages ranged from 12 to -9, while Miller’s team saw the score differential averaging between 11.75 and -20.

It is clearly demonstrated that under Woodson’s guidance the Hoosiers competed better against their in-conference teams in terms of overall wins and losses as well as score differential. Overall, this is another good mark for the Hoosiers and their decision to higher Woodson. Other components of the game still should be evaluated.

Box Score Analysis

Important Variables for Game Outcome by Team

Important Variables for Game Outcome

Before any descriptive analysis was done on this table, it was necessary to find the variables that contributed the most influence to game outcome over the past four season. Since a box score has so many variables in it, it is necessary to limit the scope of the research being conducted. Otherwise solid and well informed conclusions may be hard to configure. With this being said, several models were developed to ensure consistency in variable identification.

Manipulating the Data Before Model Creation

Filtering Box Score Data to Include Team Information

Since the analyst is interested in team trends and not individual players, the box score was filtered to only include the team’s box score lines for each game.

IU_TeamBS <- filter(IU_BoxScores, player == "TEAM")

After filtering to only include team information, the data was then filtered to only include conference opponent games. Since there was consistency in these opponents across both coaches, this acts as a control group.

IU_BS_conf <- filter(IU_TeamBS, opponent == "Illinos" | opponent == "Iowa" |  opponent == "Maryland" | opponent == "Michigan" | opponent == "Michigan State" | opponent == "Minnesota" | opponent == "Nebraska" | opponent == "Northwestern" | opponent == "Ohio State" | opponent == "Penn State" | opponent == "Purdue" | opponent == "Rutgers" | opponent == "Wisconsin" | opponent == "Illinois")

Now, some rows of the data will be deleted as they have reoccurring information that adds no value to the analysis.

IU_BS_conf <- IU_BS_conf[,-22]
IU_BS_conf <- IU_BS_conf[,-21]
#IU_BS_conf <- IU_BS_conf[,-20]
#IU_BS_conf <- IU_BS_conf[,-19]
IU_BS_conf <- IU_BS_conf[,-13]
IU_BS_conf <- IU_BS_conf[,-4]
IU_BS_conf <- IU_BS_conf[,-3]
IU_BS_conf <- IU_BS_conf[,-2]
IU_BS_conf <- IU_BS_conf[,-1]
IU_BS_conf

## # A tibble: 83 × 18
##      FGM   FGA ThreePTM ThreePTA   FTM   FTA  OREB  DREB   AST   STL   BLK    TO
##    <dbl> <dbl>    <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1    21    50        5       14    17    26    10    18    12     3     3    12
##  2    32    68        5       25    27    38    19    35    14     5     6    15
##  3    22    61        4       18    11    18    15    27     7     3     6    14
##  4    20    54        3       14    23    30    15    25    11     9     4    16
##  5    20    49        6       12    20    36     8    27    10    11     4    11
##  6    19    60        2       19    10    12    11    26     6     4     2    16
##  7    31    61        8       26    12    20    12    36    21     2     8    16
##  8    26    57        4       12    11    20    10    21    12     6     2     8
##  9    30    57        9       19     7    10     9    24    22     0     2     6
## 10    19    57        2       11     9    10    11    33     9     3     3    17
## # … with 73 more rows, and 6 more variables: PF <dbl>, PTS <dbl>, POSS <dbl>,
## #   opponent <chr>, location <chr>, outcome <dbl>

Next, create generate shooting percentages within the data set.

IU_BS_conf$FGP <- (IU_BS_conf$FGM / IU_BS_conf$FGA) * 100
IU_BS_conf$ThreePTP <- (IU_BS_conf$ThreePTM / IU_BS_conf$ThreePTA) * 100
IU_BS_conf$FTP <- (IU_BS_conf$FTM / IU_BS_conf$FTA) * 100

Now, delete the columns that were used to make the new columns so that colinearity does not become a factor for the models:

IU_BS_conf <- IU_BS_conf[,-(1:6)]

The last step is to create a training and testing set to generate and test the models. Since the model is less focused on predictive ability and more focused on variable importance, the data will be split into 90% training and 10% testing. Model accuracy will be found to provide the analyst with which model is the best to help rank the variable importance. The code to produce the training and testing sets is displayed below:

set.seed(1234) #used to provide consistency across the training and testing sets in successive runs

index <- sample(nrow(IU_BS_conf), nrow(IU_BS_conf)*0.90)
IU_BS_conf_train = IU_BS_conf[index,]
IU_BS_conf_test = IU_BS_conf[-index,]

Logistic Regression

To find important variables in the box score data set in terms of game outcome, the analyst first created a logistic regression model. This is one of the simplest predictive models, but still can be one of the most effective. The steps to create a statistically significant logistic regression model are included below:

## Train a logistic regression model with all variables
glm0 <- glm(outcome~ ., family = binomial, data = IU_BS_conf_train)

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#the family of models we are using is a binary variable.
summary(glm0)

## 
## Call:
## glm(formula = outcome ~ ., family = binomial, data = IU_BS_conf_train)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -2.364e-05  -2.110e-08  -2.110e-08   2.110e-08   2.205e-05  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)
## (Intercept)             2.327e+02  1.612e+06   0.000    1.000
## OREB                   -4.519e+00  2.390e+04   0.000    1.000
## DREB                    1.883e+01  1.388e+04   0.001    0.999
## AST                    -1.236e+01  2.260e+04  -0.001    1.000
## STL                     2.206e+01  2.257e+04   0.001    0.999
## BLK                     3.374e+00  2.286e+04   0.000    1.000
## TO                      1.018e+01  2.905e+04   0.000    1.000
## PF                     -3.189e+00  1.570e+04   0.000    1.000
## PTS                     9.380e+00  2.761e+04   0.000    1.000
## POSS                   -1.717e+01  2.867e+04  -0.001    1.000
## opponentIowa            1.074e+01  1.976e+05   0.000    1.000
## opponentMaryland        2.303e+01  1.951e+05   0.000    1.000
## opponentMichigan        1.225e+02  1.704e+05   0.001    0.999
## opponentMichigan State  4.131e+01  1.721e+05   0.000    1.000
## opponentMinnesota      -4.345e+00  2.256e+05   0.000    1.000
## opponentNebraska       -3.472e+01  3.589e+05   0.000    1.000
## opponentNorthwestern   -3.852e+01  2.233e+05   0.000    1.000
## opponentOhio State     -4.011e+00  1.761e+05   0.000    1.000
## opponentPenn State      6.474e+01  2.933e+05   0.000    1.000
## opponentPurdue          1.026e+02  2.386e+05   0.000    1.000
## opponentRutgers        -7.173e+00  1.490e+05   0.000    1.000
## opponentWisconsin      -5.507e+01  1.340e+05   0.000    1.000
## locationH               7.684e+00  1.040e+05   0.000    1.000
## locationN               5.747e+01  1.306e+05   0.000    1.000
## FGP                    -4.005e-01  2.112e+04   0.000    1.000
## ThreePTP                2.855e-01  5.709e+03   0.000    1.000
## FTP                    -3.134e+00  7.109e+03   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.0253e+02  on 73  degrees of freedom
## Residual deviance: 3.7386e-09  on 47  degrees of freedom
## AIC: 54
## 
## Number of Fisher Scoring iterations: 25

Start by removing opponent within the logistic model, since it appears to be the problem in regards to perfect linearity within the model.

glm1 <- glm(outcome ~ OREB +DREB + AST +STL +BLK +TO + PF + location + FGP + ThreePTP + FTP, data = IU_BS_conf_train)
summary(glm1)

## 
## Call:
## glm(formula = outcome ~ OREB + DREB + AST + STL + BLK + TO + 
##     PF + location + FGP + ThreePTP + FTP, data = IU_BS_conf_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.78686  -0.24335   0.02487   0.23441   0.79378  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.6993854  0.5492345  -4.915 7.02e-06 ***
## OREB         0.0189809  0.0152656   1.243 0.218488    
## DREB         0.0510768  0.0100081   5.104 3.51e-06 ***
## AST         -0.0343935  0.0152200  -2.260 0.027419 *  
## STL          0.0692396  0.0188813   3.667 0.000517 ***
## BLK          0.0294810  0.0220482   1.337 0.186149    
## TO          -0.0250526  0.0137841  -1.817 0.074053 .  
## PF          -0.0015966  0.0134431  -0.119 0.905850    
## locationH    0.1009491  0.1035476   0.975 0.333458    
## locationN    0.0991390  0.2069266   0.479 0.633579    
## FGP          0.0427634  0.0087269   4.900 7.40e-06 ***
## ThreePTP     0.0030286  0.0037434   0.809 0.421626    
## FTP         -0.0004662  0.0041730  -0.112 0.911418    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1314617)
## 
##     Null deviance: 18.4865  on 73  degrees of freedom
## Residual deviance:  8.0192  on 61  degrees of freedom
## AIC: 73.558
## 
## Number of Fisher Scoring iterations: 2

Now, the variable with the highest p-value will be removed until all variables lie under the 0.05 mark of significance. Starting with the variable with the highest p-value value then regenerate the model and rerun it. First start by removing FTP (p-value of 0.911418).

glm2 <- glm(outcome ~ OREB +DREB + AST +STL +BLK +TO + PF + location + FGP + ThreePTP, data = IU_BS_conf_train)
summary(glm2)

## 
## Call:
## glm(formula = outcome ~ OREB + DREB + AST + STL + BLK + TO + 
##     PF + location + FGP + ThreePTP, data = IU_BS_conf_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.78531  -0.23875   0.02996   0.23290   0.79034  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.728116   0.481418  -5.667 4.04e-07 ***
## OREB         0.019392   0.014697   1.319 0.191873    
## DREB         0.050960   0.009874   5.161 2.74e-06 ***
## AST         -0.034404   0.015098  -2.279 0.026142 *  
## STL          0.069152   0.018714   3.695 0.000467 ***
## BLK          0.029543   0.021865   1.351 0.181559    
## TO          -0.025298   0.013499  -1.874 0.065644 .  
## PF          -0.001735   0.013279  -0.131 0.896478    
## locationH    0.098588   0.100557   0.980 0.330690    
## locationN    0.098361   0.205156   0.479 0.633309    
## FGP          0.042805   0.008649   4.949 6.03e-06 ***
## ThreePTP     0.003024   0.003713   0.815 0.418475    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1293679)
## 
##     Null deviance: 18.4865  on 73  degrees of freedom
## Residual deviance:  8.0208  on 62  degrees of freedom
## AIC: 71.573
## 
## Number of Fisher Scoring iterations: 2

Now, remove PF from the model (p-value of 0.896478):

glm3 <- glm(outcome ~ OREB +DREB + AST +STL +BLK + TO + location + FGP + ThreePTP, data = IU_BS_conf_train)
summary(glm3)

## 
## Call:
## glm(formula = outcome ~ OREB + DREB + AST + STL + BLK + TO + 
##     location + FGP + ThreePTP, data = IU_BS_conf_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.79211  -0.23701   0.02454   0.23082   0.79408  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.754167   0.434750  -6.335 2.84e-08 ***
## OREB         0.019144   0.014460   1.324 0.190307    
## DREB         0.051163   0.009674   5.289 1.65e-06 ***
## AST         -0.034311   0.014963  -2.293 0.025196 *  
## STL          0.069125   0.018567   3.723 0.000422 ***
## BLK          0.029264   0.021590   1.355 0.180121    
## TO          -0.025485   0.013318  -1.914 0.060227 .  
## locationH    0.100935   0.098165   1.028 0.307778    
## locationN    0.099364   0.203406   0.489 0.626890    
## FGP          0.042669   0.008519   5.009 4.71e-06 ***
## ThreePTP     0.003046   0.003681   0.827 0.411094    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1273494)
## 
##     Null deviance: 18.486  on 73  degrees of freedom
## Residual deviance:  8.023  on 63  degrees of freedom
## AIC: 69.593
## 
## Number of Fisher Scoring iterations: 2

Then, remove the variable location (p-value of 0.626890).

glm4 <- glm(outcome ~ OREB +DREB + AST +STL +BLK + TO + FGP + ThreePTP, data = IU_BS_conf_train)
summary(glm4)

## 
## Call:
## glm(formula = outcome ~ OREB + DREB + AST + STL + BLK + TO + 
##     FGP + ThreePTP, data = IU_BS_conf_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7299  -0.2329   0.0369   0.2156   0.7434  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.839364   0.419927  -6.762 4.57e-09 ***
## OREB         0.021678   0.014016   1.547   0.1268    
## DREB         0.051360   0.009564   5.370 1.13e-06 ***
## AST         -0.035358   0.013767  -2.568   0.0125 *  
## STL          0.076635   0.017065   4.491 2.98e-05 ***
## BLK          0.029000   0.021440   1.353   0.1809    
## TO          -0.027730   0.013026  -2.129   0.0371 *  
## FGP          0.044986   0.008059   5.582 4.99e-07 ***
## ThreePTP     0.003198   0.003600   0.888   0.3777    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1256607)
## 
##     Null deviance: 18.4865  on 73  degrees of freedom
## Residual deviance:  8.1679  on 65  degrees of freedom
## AIC: 66.918
## 
## Number of Fisher Scoring iterations: 2

Now, remove the variable ThreePTP (p-value of 0.3777).

glm5 <- glm(outcome ~ OREB +DREB + AST +STL +BLK + TO + FGP, data = IU_BS_conf_train)
summary(glm5)

## 
## Call:
## glm(formula = outcome ~ OREB + DREB + AST + STL + BLK + TO + 
##     FGP, data = IU_BS_conf_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.68930  -0.23238   0.04607   0.19598   0.76797  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.821227   0.418759  -6.737 4.74e-09 ***
## OREB         0.021604   0.013993   1.544   0.1274    
## DREB         0.049572   0.009335   5.310 1.38e-06 ***
## AST         -0.033562   0.013596  -2.468   0.0162 *  
## STL          0.077769   0.016990   4.577 2.14e-05 ***
## BLK          0.031279   0.021252   1.472   0.1458    
## TO          -0.027660   0.013005  -2.127   0.0372 *  
## FGP          0.047194   0.007654   6.166 4.77e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1252591)
## 
##     Null deviance: 18.4865  on 73  degrees of freedom
## Residual deviance:  8.2671  on 66  degrees of freedom
## AIC: 65.811
## 
## Number of Fisher Scoring iterations: 2

Next, removal of BLK from the model (p-value of 0.1458).

glm6 <- glm(outcome ~ OREB +DREB + AST +STL + TO + FGP, data = IU_BS_conf_train)
summary(glm6)

## 
## Call:
## glm(formula = outcome ~ OREB + DREB + AST + STL + TO + FGP, data = IU_BS_conf_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.68396  -0.28410   0.03347   0.22680   0.73557  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.854931   0.421756  -6.769 3.91e-09 ***
## OREB         0.021101   0.014110   1.495   0.1395    
## DREB         0.052555   0.009191   5.718 2.72e-07 ***
## AST         -0.030042   0.013500  -2.225   0.0294 *  
## STL          0.077911   0.017137   4.546 2.35e-05 ***
## TO          -0.026437   0.013091  -2.019   0.0474 *  
## FGP          0.047865   0.007706   6.211 3.80e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1274394)
## 
##     Null deviance: 18.4865  on 73  degrees of freedom
## Residual deviance:  8.5384  on 67  degrees of freedom
## AIC: 66.201
## 
## Number of Fisher Scoring iterations: 2

Then, OREB needed to be removed from the model (p-value of 0.1395).

glm7 <- glm(outcome ~ DREB + AST +STL + TO + FGP, data = IU_BS_conf_train)
summary(glm7)

## 
## Call:
## glm(formula = outcome ~ DREB + AST + STL + TO + FGP, data = IU_BS_conf_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.64272  -0.25725   0.01201   0.22957   0.68877  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.659992   0.404736  -6.572 8.29e-09 ***
## DREB         0.054565   0.009174   5.948 1.05e-07 ***
## AST         -0.025655   0.013297  -1.929   0.0579 .  
## STL          0.082628   0.016996   4.861 7.20e-06 ***
## TO          -0.019890   0.012449  -1.598   0.1147    
## FGP          0.043254   0.007127   6.069 6.43e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1297567)
## 
##     Null deviance: 18.4865  on 73  degrees of freedom
## Residual deviance:  8.8235  on 68  degrees of freedom
## AIC: 66.631
## 
## Number of Fisher Scoring iterations: 2

Now, TO needed to be removed from the model (p-value of 0.1147).

glm8 <- glm(outcome ~ DREB + AST +STL + FGP, data = IU_BS_conf_train)
summary(glm8)

## 
## Call:
## glm(formula = outcome ~ DREB + AST + STL + FGP, data = IU_BS_conf_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.71782  -0.25344   0.00512   0.25033   0.79047  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.789618   0.400958  -6.957 1.59e-09 ***
## DREB         0.053195   0.009236   5.759 2.15e-07 ***
## AST         -0.026212   0.013441  -1.950   0.0552 .  
## STL          0.082870   0.017186   4.822 8.17e-06 ***
## FGP          0.041966   0.007160   5.861 1.43e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1326765)
## 
##     Null deviance: 18.4865  on 73  degrees of freedom
## Residual deviance:  9.1547  on 69  degrees of freedom
## AIC: 67.358
## 
## Number of Fisher Scoring iterations: 2

Lastly, remove the variable AST (p-value of 0.0552).

glm9 <- glm(outcome ~ DREB +STL + FGP, data = IU_BS_conf_train)
summary(glm9)

## 
## Call:
## glm(formula = outcome ~ DREB + STL + FGP, data = IU_BS_conf_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.81631  -0.25910   0.03403   0.25014   0.83944  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.673482   0.404371  -6.611 6.33e-09 ***
## DREB         0.047562   0.008947   5.316 1.20e-06 ***
## STL          0.082876   0.017527   4.729 1.14e-05 ***
## FGP          0.034650   0.006219   5.571 4.39e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.137989)
## 
##     Null deviance: 18.4865  on 73  degrees of freedom
## Residual deviance:  9.6592  on 70  degrees of freedom
## AIC: 69.328
## 
## Number of Fisher Scoring iterations: 2

Finally, all of the variables were considered statistically significant and the logistic regression model found the following variables to be significant:

FGP (most)
DREB
STL (least)

Now, to be able to compare this model to other models use the testing set to evaluate how well the model predicts game outcome and find the misclassification rate.

pred_glm9_test <- predict(glm9, newdata = IU_BS_conf_test, type = "response")
table(IU_BS_conf_test$outcome, (pred_glm9_test > 0.5)*1, dnn = c("Truth", "Predicted"))

##      Predicted
## Truth 0 1
##     0 4 2
##     1 0 3

The misclassification rate of the logistic regression model is 2/9 or 22.2%.

Classification Trees

After returning values for the logistic regression, the analyst then looked to generate a classification tree to define the variable importance using the same training set. The code to do this is displayed below:

rpart0 <- rpart(formula = outcome ~ ., data = IU_BS_conf, method = "class")
rpart0

## n= 83 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 83 39 0 (0.53012048 0.46987952)  
##    2) PTS< 65.5 33  6 0 (0.81818182 0.18181818)  
##      4) DREB< 29 26  2 0 (0.92307692 0.07692308) *
##      5) DREB>=29 7  3 1 (0.42857143 0.57142857) *
##    3) PTS>=65.5 50 17 1 (0.34000000 0.66000000)  
##      6) opponent=Illinois,Iowa,Maryland,Michigan State,Northwestern,Purdue,Rutgers,Wisconsin 29 14 0 (0.51724138 0.48275862)  
##       12) PF>=20.5 8  1 0 (0.87500000 0.12500000) *
##       13) PF< 20.5 21  8 1 (0.38095238 0.61904762)  
##         26) DREB< 24.5 10  4 0 (0.60000000 0.40000000) *
##         27) DREB>=24.5 11  2 1 (0.18181818 0.81818182) *
##      7) opponent=Michigan,Minnesota,Nebraska,Ohio State,Penn State 21  2 1 (0.09523810 0.90476190) *

prp(rpart0, digits = 4, extra = 1)

When the classification tree was generated, the following variables were denoted as important in relation to outcome:

PTS (most)
DREB
opponent
PF (least)

This model has one variable in common with the logistic regression model. The analyst then compared model accuracy by generating the misclassification rate for the classification tree.

pred0 <- predict(rpart0, IU_BS_conf_test, type = "class")

#table representing the number of predictions matched correctly with the testing set
table(IU_BS_conf_test$outcome, pred0, dnn = c("True", "Pred"))

##     Pred
## True 0 1
##    0 6 0
##    1 0 3

The model was perfect at predicting the testing set (9/9) observation. This is a little alarming, however since the analyst is only looking for variable importance as opposed to being able to predict this will be ignored.

Random Forest

After completing the classification tree, the analyst then looked to find variable significance using random forests.

First, change any categorical variables into factors.

IU_BS_conf_train$outcome <- as.factor(IU_BS_conf_train$outcome)
IU_BS_conf_test$outcome <- as.factor(IU_BS_conf_test$outcome)
IU_BS_conf_train$opponent <- as.factor(IU_BS_conf_train$opponent)
IU_BS_conf_test$opponent <- as.factor(IU_BS_conf_test$opponent)
IU_BS_conf_train$location <- as.factor(IU_BS_conf_train$location)
IU_BS_conf_test$location <- as.factor(IU_BS_conf_test$location)

Now, the random forest model can be generated.

rf0 <- randomForest(outcome~., data = IU_BS_conf_train, importance = TRUE)
rf0

## 
## Call:
##  randomForest(formula = outcome ~ ., data = IU_BS_conf_train,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 32.43%
## Confusion matrix:
##    0  1 class.error
## 0 21 17   0.4473684
## 1  7 29   0.1944444

After the model has been created, variable importance can be found by using the following code. The results displayed here will rank the importance similar to that of the regression model done earlier. Note that the variables with the highest MeanDecreaseAccuracy are the most important variables.

rf0$importance

##                      0             1 MeanDecreaseAccuracy MeanDecreaseGini
## OREB     -0.0045716768 -0.0031017532        -0.0034256084         1.330370
## DREB      0.0198105896  0.0263716827         0.0233921829         4.014946
## AST       0.0059317373  0.0032623310         0.0045329569         1.819215
## STL       0.0061118361  0.0101952693         0.0084751543         2.518112
## BLK      -0.0003306202  0.0008541847        -0.0003431658         1.644625
## TO        0.0013874397 -0.0031745179        -0.0009305782         1.416572
## PF       -0.0038763941 -0.0013647694        -0.0023461074         1.368019
## PTS       0.0351203506  0.0316189009         0.0324080885         4.650948
## POSS      0.0007125838  0.0054537647         0.0024841964         1.878996
## opponent  0.0067425759  0.0165795212         0.0112263215         6.988677
## location  0.0002209774  0.0134348753         0.0061603603         1.484484
## FGP       0.0248471299  0.0193728572         0.0216079159         3.553630
## ThreePTP -0.0020214087  0.0068091265         0.0030250211         2.212006
## FTP      -0.0051409723 -0.0020567914        -0.0039221743         1.609345

The values that were found to be most important in relation to game outcome for the random forest model are as listed below:

PTS (most)
DREB
FGP
opponent
STL
location (least)

There are once again reoccurring variables, as well as new variables identified within this model. Now, the analyst will use the testing set to find the misclassification rate.

#a trick to ensure that the levels of the training set and testing set are the same to avoid any error clauses
IU_BS_conf_test <- rbind(IU_BS_conf_train[1, ] , IU_BS_conf_test)
    IU_BS_conf_test <- IU_BS_conf_test[-1,]

rf_pred <- predict(rf0, IU_BS_conf_test)

#shows how our predictions match up against the actual values.
table(IU_BS_conf_test$outcome, rf_pred, dnn = c("True", "Pred"))

##     Pred
## True 0 1
##    0 5 1
##    1 0 3

This model has a misclassification rate of 1/9 or about 11.1%. This model is better than the logistic regression model but worse than the classification trees.

Important Variable Conclusion

The following variable was found to be significant for game outcome in all models: DREB. While it was the only variable to be consistent in all three models, there were a number of variables that were found to be important in at least two models. These include: FGP, STL, PTS, and opponent. There were two variables that were found to be significant in just one model out of the three. These include: PF and location.

Not to say that the analyst is limited to just looking at variables included in the list above, however the general focus will be on these variables as comparisons across coaches are being made.

In-Conference Team Averages

Manipulating the Data for Coaches Averages

First, the analyst needed to format the data so it presented box score information for the duration of each coach.

IU_BS_conf <- filter(IU_TeamBS, opponent == "Illinos" | opponent == "Iowa" |  opponent == "Maryland" | opponent == "Michigan" | opponent == "Michigan State" | opponent == "Minnesota" | opponent == "Nebraska" | opponent == "Northwestern" | opponent == "Ohio State" | opponent == "Penn State" | opponent == "Purdue" | opponent == "Rutgers" | opponent == "Wisconsin" | opponent == "Illinois")

Remake the shooting percentages.

IU_BS_conf$FGP <- (IU_BS_conf$FGM / IU_BS_conf$FGA) * 100
IU_BS_conf$ThreePTP <- (IU_BS_conf$ThreePTM / IU_BS_conf$ThreePTA) * 100
IU_BS_conf$FTP <- (IU_BS_conf$FTM / IU_BS_conf$FTA) * 100

Now, remove the unnecessary columns from the data set.

IU_BS_conf <- IU_BS_conf[,-22]
IU_BS_conf <- IU_BS_conf[,-21]
IU_BS_conf <- IU_BS_conf[,-20]
IU_BS_conf <- IU_BS_conf[,-(2:10)]

This data was then formatted to match the seasons each coach lead the team. This is 2019-20 and 2020-21 for Archie Miller and 2021-22 and 2022-23 for Mike Woodson.

IU_BS_confAM <- filter(IU_BS_conf, season == "2020" | season == "2021")
IU_BS_confMW <- filter(IU_BS_conf, season == "2022" | season == "2023")

Archie Miller’s Conference Averages

Now, each of these were aggregated for comparison. First, by the two season span for Archie Miller.

summary(IU_BS_confAM)

##      season          OREB           DREB            REB             AST       
##  Min.   :2020   Min.   : 2.0   Min.   :14.00   Min.   :21.00   Min.   : 6.00  
##  1st Qu.:2020   1st Qu.: 8.0   1st Qu.:21.00   1st Qu.:28.75   1st Qu.: 9.75  
##  Median :2020   Median :10.0   Median :25.00   Median :35.00   Median :13.00  
##  Mean   :2020   Mean   :10.1   Mean   :24.85   Mean   :34.95   Mean   :12.53  
##  3rd Qu.:2021   3rd Qu.:12.0   3rd Qu.:27.00   3rd Qu.:39.00   3rd Qu.:15.00  
##  Max.   :2021   Max.   :19.0   Max.   :36.00   Max.   :54.00   Max.   :22.00  
##       STL              BLK              TO              PF       
##  Min.   : 0.000   Min.   :0.000   Min.   : 6.00   Min.   :10.00  
##  1st Qu.: 3.750   1st Qu.:2.000   1st Qu.: 9.00   1st Qu.:14.00  
##  Median : 6.000   Median :3.000   Median :12.00   Median :17.50  
##  Mean   : 5.425   Mean   :3.475   Mean   :11.85   Mean   :17.48  
##  3rd Qu.: 7.000   3rd Qu.:5.000   3rd Qu.:14.25   3rd Qu.:19.25  
##  Max.   :11.000   Max.   :8.000   Max.   :17.00   Max.   :28.00  
##       PTS          opponent           location            outcome   
##  Min.   :49.00   Length:40          Length:40          Min.   :0.0  
##  1st Qu.:59.00   Class :character   Class :character   1st Qu.:0.0  
##  Median :66.50   Mode  :character   Mode  :character   Median :0.0  
##  Mean   :67.45                                         Mean   :0.4  
##  3rd Qu.:72.25                                         3rd Qu.:1.0  
##  Max.   :96.00                                         Max.   :1.0  
##       FGP           ThreePTP          FTP       
##  Min.   :25.42   Min.   :10.00   Min.   :40.00  
##  1st Qu.:37.23   1st Qu.:21.33   1st Qu.:60.83  
##  Median :42.30   Median :33.33   Median :68.20  
##  Mean   :42.18   Mean   :33.01   Mean   :67.58  
##  3rd Qu.:45.96   3rd Qu.:41.11   3rd Qu.:76.52  
##  Max.   :57.78   Max.   :62.50   Max.   :90.00

Next, the information was derived by each season Archie Miller was head coach.

aggregate(cbind(OREB, DREB, AST, STL, BLK, TO, PF, PTS, FGP, ThreePTP, FTP) ~ season, data = IU_BS_confAM, FUN = mean, na.rm = TRUE)

##   season  OREB DREB   AST  STL  BLK   TO    PF   PTS      FGP ThreePTP      FTP
## 1   2020 11.05 25.3 11.75 5.10 3.80 12.2 17.15 66.45 41.78453 32.68554 68.42506
## 2   2021  9.15 24.4 13.30 5.75 3.15 11.5 17.80 68.45 42.57464 33.34246 66.74194

These values will be listed in an organized set of tables under the conclusion tab.

Mike Woodson’s Conference Averages

Now, the same was done for Mike Woodson. First, starting with the overall averages for both seasons.

summary(IU_BS_confMW)

##      season          OREB             DREB            REB       
##  Min.   :2022   Min.   : 2.000   Min.   :14.00   Min.   :20.00  
##  1st Qu.:2022   1st Qu.: 7.000   1st Qu.:22.50   1st Qu.:31.00  
##  Median :2022   Median : 9.000   Median :25.00   Median :34.00  
##  Mean   :2022   Mean   : 8.791   Mean   :25.44   Mean   :34.23  
##  3rd Qu.:2023   3rd Qu.:10.500   3rd Qu.:29.00   3rd Qu.:39.00  
##  Max.   :2023   Max.   :15.000   Max.   :35.00   Max.   :45.00  
##       AST             STL              BLK               TO       
##  Min.   : 6.00   Min.   : 1.000   Min.   : 1.000   Min.   : 3.00  
##  1st Qu.:11.00   1st Qu.: 4.000   1st Qu.: 3.000   1st Qu.: 9.00  
##  Median :14.00   Median : 5.000   Median : 4.000   Median :10.00  
##  Mean   :13.91   Mean   : 5.302   Mean   : 4.442   Mean   :10.81  
##  3rd Qu.:16.00   3rd Qu.: 7.000   3rd Qu.: 6.000   3rd Qu.:13.00  
##  Max.   :22.00   Max.   :11.000   Max.   :10.000   Max.   :23.00  
##        PF            PTS          opponent           location        
##  Min.   :10.0   Min.   :48.00   Length:43          Length:43         
##  1st Qu.:15.0   1st Qu.:62.00   Class :character   Class :character  
##  Median :18.0   Median :68.00   Mode  :character   Mode  :character  
##  Mean   :17.3   Mean   :68.81                                        
##  3rd Qu.:20.0   3rd Qu.:74.50                                        
##  Max.   :25.0   Max.   :89.00                                        
##     outcome            FGP           ThreePTP          FTP        
##  Min.   :0.0000   Min.   :30.36   Min.   :12.50   Min.   : 46.15  
##  1st Qu.:0.0000   1st Qu.:40.98   1st Qu.:26.79   1st Qu.: 66.67  
##  Median :1.0000   Median :45.16   Median :31.58   Median : 71.43  
##  Mean   :0.5349   Mean   :45.33   Mean   :34.62   Mean   : 71.98  
##  3rd Qu.:1.0000   3rd Qu.:49.44   3rd Qu.:40.83   3rd Qu.: 78.17  
##  Max.   :1.0000   Max.   :61.82   Max.   :76.92   Max.   :100.00

Then, the aggregation by season was done to find season averages.

aggregate(cbind(OREB, DREB, AST, STL, BLK, TO, PF, PTS, FGP, ThreePTP, FTP) ~ season, data = IU_BS_confMW, FUN = mean, na.rm = TRUE)

##   season     OREB     DREB      AST      STL      BLK       TO       PF
## 1   2022 8.695652 25.34783 14.13043 5.652174 4.434783 10.04348 17.13043
## 2   2023 8.900000 25.55000 13.65000 4.900000 4.450000 11.70000 17.50000
##        PTS      FGP ThreePTP      FTP
## 1 67.82609 43.97354 32.85037 71.89765
## 2 69.95000 46.89440 36.66028 72.07123

These values will be listed in an organized set of tables under the conclusion tab.

Conference Average Conclusions

This information was then all stored in a table for easy comparison between the two coaches and their respective season averages.

First, look at the variables identified as important to game outcome and compare these values:

DREB- Mike Woodson’s two year average is better than Archie Miller’s, however the differences here are not much. In fact, both coaches first years are at 25.3 per game. The difference is in each successive year. Miller’s defensive rebound average decreasing while Woodson’s average increasing.
FGP- Once again Mike Woodson’s two year average is better than Archie Miller’s by more than 3%. In fact, in the two year time span present, Mike Woodson posted better numbers than Archie Miller. Further, Mike Woodson’s teams saw improvement from one season to the next going from 44% in 2021-22 to 46.9% in 2022-23.
STL- Archie Miller’s teams posted a better average for steals than Mike Woodson’s. In fact, Archie Miller increased the average number of steals in his two year span. On the other hand, Mike Woodson saw a decrease in steals during his second year as coach.
PTS- Mike Woodson has a higher average points scored than Archie Miller. As this has been explored within the schedule analysis portion, this finding is not new or shocking. Both coaches increased over the course of each year by about 2 points. It is shown that Woodson’s teams just score more.
PF- personal fouls post about the same number for each coach and even across seasons. There is no real comparison here as they are so similar.

Other notable observations are:

FTP- Mike Woodson’s teams free throw percentages are far above the averages of Archie Miller’s teams. In fact, there is almost a 5% difference in the coaches’ averages. Mike Woodson even saw an increase over his two seasons as head coach.
ThreePTP- While not a remarkable difference in average three point percentage between the coaches, Woodson still hosts a better average. The observation worth noting is the clear improvement from year to year. Both coaches improved their three point percentages from one season to the next, however more impressive is Woodson’s improvement by nearly 3% in comparison to Miller’s 1% increase.

In-Conference Team Performance by Month

The analyst set out to dissect the data by date to see if the coaches had general trends over the course of the season and throughout conference play using the variables defined as important in the earlier analysis. All of these variables were looked at in terms of averages by month. The first month of play denoted by November. These visuals were all created in Tableau and screenshots are included.

Defensive Rebounds (DREB)

First, start with Coach Archie Miller:

Over the last two seasons that Archie Miller was head coach, the following was observed:

Under Miller, the Hoosiers appeared to have a consistent number of defensive rebounds in conference play regardless of the part of the season. Miller’s teams floated just under 25 defensive rebounds a game per each month.
The general trend for the two seasons in this data set for Archie Miller, show an increase in the number of defensive rebounds for the team as the season went on.

Now, look at Coach Mike Woodson:

Over the last two seasons that Mike Woodson has been head coach, the following was observed:

Under Woodson, the Hoosiers appeared to have a decreasing number of defensive rebounds in conference play as the season progresses.
Woodson’s teams typically stay above the 25 defensive rebound mark. The only month that appears to be lower than the 25 rebound mark is in February.
Between February and March there was an increase in the number of defensive rebounds, however it was not more than the amount that had occurred for either December or January.

In short, Coach Woodson appears to have a higher defensive rebound average throughout the season, than Coach Miller. However, the trends of the two coaches are opposite. The visuals display that Miller’s teams (starting with a lower average) generally increase their number of defensive rebounds, while Woodson’s teams (starting with a higher average) generally decreases their number of defensive rebounds throughout conference play. Lastly, it is easy to see that the two coaches have around the same minimum average value for defensive rebounds, but Woodson has a higher maximum average value for defensive rebounds per month.

Field Goal Percentage (FGP)

First, start with Coach Archie Miller:

Over the last two seasons that Archie Miller was head coach, the following was observed:

The start of conference play was the best for his teams, posting their highest average monthly field goal percentage at about 47% in December.
Under Miller, the Hoosiers struggled deeper into conference play, posting their lowest average monthly field goal percentage at just under 40%.
The general trend for the two seasons in this data set for Archie Miller is a decrease in shooting effectiveness as they get deeper into conference play.

Now, look at Coach Mike Woodson:

Over the last two seasons that Mike Woodson has been head coach, the following was observed:

The start of conference play has been the worst for his teams with December holding a field goal percentage average of just over 40%.
Throughout the past two seasons January is the month that they have been the best shooting with a field goal percentage around 47%.
Under Woodson, the general trend for the past two season has been an increase in shooting effectiveness as they get deeper into conference play. While March does not post the highest field goal percentage it is still higher than both December and February.

In short, while both coaches lead the team to around the same best and worst average field goal percentages by month, the timing of each is different. Miller’s last two season teams saw an overall decrease in field goal percentage over the course of the season, while Woodson’s team generally saw an increase.

Steals

First, start with Coach Archie Miller:

Over the last two seasons that Archie Miller was head coach, the following was observed:

Steals accomplished typically stayed between the 5-6 average range per month.
The lowest monthly steal average occured in January, while the highest monthly steal average occured in February.
The general trend for the two seasons in this data set for Archie Miller is an increase in steals as they get deeper into conference play.

Now, look at Coach Mike Woodson:

Over the last two seasons that Mike Woodson has been head coach, the following was observed:

Average steals accomplished by month varied more widely. Under Woodson the values ranged from just under 5 to just under 7.
The lowest monthly steal average occured in January, while the highest monthly steal average occured in March.
The general trend for the two seasons in this data set for Archie Miller is an increase in steals as they get deeper into conference play.

In short, while both coaches increase the number of steals as conference play continues, Miller’s last two season teams seem to be more consistent in this area of play. However, Mike Woodson’s teams (in the month of March) by far out due any of the monthly averages that Miller’s teams posted.

Team Averages Against In-Conference Opponents

Team Averages Against In-Confernce Opponents

After looking at general trends over the course of conference play, the analyst then looked to derive opponent specific information. Similar as done before the analyst looked to derive averages per coach and per season.

Archie Miller’s Conference Opponent Averages

Now, each of these were aggregated for comparison. First, by the two season span for Archie Miller.

AM_conf_avg <- aggregate(cbind(OREB, DREB, AST, STL, TO, FGP, ThreePTP) ~ opponent, data = IU_BS_confAM, FUN = mean, na.rm = TRUE)
AM_conf_avg

##          opponent      OREB     DREB      AST      STL       TO      FGP
## 1        Illinois  8.666667 26.33333 11.00000 5.333333 11.00000 40.35796
## 2            Iowa 13.333333 25.00000 15.33333 8.333333 11.33333 43.63191
## 3        Maryland 12.000000 27.33333 13.00000 2.333333 10.00000 41.99510
## 4        Michigan  7.000000 17.00000  7.50000 2.500000  9.00000 42.18159
## 5  Michigan State 10.000000 22.33333 11.66667 7.333333  9.00000 40.73365
## 6       Minnesota  9.000000 27.00000 14.66667 4.333333 12.33333 51.02323
## 7        Nebraska 14.333333 33.33333 16.33333 3.666667 13.00000 48.01078
## 8    Northwestern 13.000000 26.00000 13.00000 8.333333 15.00000 40.19928
## 9      Ohio State  7.666667 21.33333 11.00000 6.000000 13.00000 42.68481
## 10     Penn State  8.000000 27.00000 10.33333 6.000000 14.00000 44.35626
## 11         Purdue  9.250000 20.50000 12.00000 5.750000 12.00000 37.72054
## 12        Rutgers 10.250000 24.25000 12.25000 5.750000 12.50000 37.02235
## 13      Wisconsin  8.000000 24.66667 13.33333 3.666667 10.66667 41.62329
##    ThreePTP
## 1  46.29630
## 2  41.84224
## 3  29.25749
## 4  25.83333
## 5  21.46199
## 6  35.63492
## 7  29.42308
## 8  31.21693
## 9  47.22222
## 10 28.49168
## 11 23.14312
## 12 32.49269
## 13 37.93651

Next, the information was derived by each season Archie Miller was head coach.

AM_conf_avg_by_season <- aggregate(cbind(OREB, DREB, AST, STL, TO, FGP, ThreePTP) ~ opponent + season, data = IU_BS_confAM, FUN = mean, na.rm = TRUE)
AM_conf_avg_by_season

##          opponent season OREB     DREB      AST       STL       TO      FGP
## 1        Illinois   2020 12.0 27.00000 12.00000  4.000000 10.00000 40.67797
## 2            Iowa   2020 16.0 23.00000 16.00000 11.000000 17.00000 45.90164
## 3        Maryland   2020 12.0 25.50000 14.50000  1.500000 10.00000 44.34858
## 4        Michigan   2020  7.0 14.00000  7.00000  1.000000  7.00000 45.90164
## 5  Michigan State   2020 10.0 21.00000 12.00000  6.000000  8.00000 45.61404
## 6       Minnesota   2020  8.5 28.50000 14.50000  5.500000 10.00000 47.64595
## 7        Nebraska   2020 15.5 35.50000 17.50000  3.500000 15.50000 48.93925
## 8    Northwestern   2020 15.0 25.00000 11.00000  9.000000 16.00000 37.03704
## 9      Ohio State   2020  6.0 23.50000 10.00000  7.500000 12.00000 43.02721
## 10     Penn State   2020 10.0 29.50000  8.00000  6.500000 14.50000 37.96296
## 11         Purdue   2020 12.0 20.50000 10.50000  4.500000 13.50000 34.28049
## 12        Rutgers   2020 11.0 26.00000  6.00000  4.000000 16.00000 31.66667
## 13      Wisconsin   2020 11.0 22.00000 10.50000  4.500000  9.50000 38.24138
## 14       Illinois   2021  7.0 26.00000 10.50000  6.000000 11.50000 40.19796
## 15           Iowa   2021 12.0 26.00000 15.00000  7.000000  8.50000 42.49705
## 16       Maryland   2021 12.0 31.00000 10.00000  4.000000 10.00000 37.28814
## 17       Michigan   2021  7.0 20.00000  8.00000  4.000000 11.00000 38.46154
## 18 Michigan State   2021 10.0 23.00000 11.50000  8.000000  9.50000 38.29346
## 19      Minnesota   2021 10.0 24.00000 15.00000  2.000000 17.00000 57.77778
## 20       Nebraska   2021 12.0 29.00000 14.00000  4.000000  8.00000 46.15385
## 21   Northwestern   2021 12.0 26.50000 14.00000  8.000000 14.50000 41.78040
## 22     Ohio State   2021 11.0 17.00000 13.00000  3.000000 15.00000 42.00000
## 23     Penn State   2021  4.0 22.00000 15.00000  5.000000 13.00000 57.14286
## 24         Purdue   2021  6.5 20.50000 13.50000  7.000000 10.50000 41.16059
## 25        Rutgers   2021 10.0 23.66667 14.33333  6.333333 11.33333 38.80757
## 26      Wisconsin   2021  2.0 30.00000 19.00000  2.000000 13.00000 48.38710
##    ThreePTP
## 1  50.00000
## 2  52.38095
## 3  34.79532
## 4  25.00000
## 5  33.33333
## 6  24.28571
## 7  25.38462
## 8  21.42857
## 9  54.16667
## 10 26.94805
## 11 27.08333
## 12 10.52632
## 13 37.85714
## 14 44.44444
## 15 36.57289
## 16 18.18182
## 17 26.66667
## 18 15.52632
## 19 58.33333
## 20 37.50000
## 21 36.11111
## 22 33.33333
## 23 31.57895
## 24 19.20290
## 25 39.81481
## 26 38.09524

These values will be listed in an organized set of tables under the conclusion tab.

Mike Woodson’s Conference Averages

Now, the same was done for Mike Woodson. First, starting with the overall averages for both seasons.

MW_conf_avg <- aggregate(cbind(OREB, DREB, AST, STL, TO, FGP, ThreePTP) ~ opponent, data = IU_BS_confMW, FUN = mean, na.rm = TRUE)
MW_conf_avg

##          opponent      OREB     DREB      AST      STL        TO      FGP
## 1        Illinois  9.500000 27.50000 13.75000 5.000000 12.000000 46.97511
## 2            Iowa 10.000000 21.50000 17.75000 6.750000 14.250000 48.97580
## 3        Maryland  7.333333 28.00000 14.33333 4.333333 11.000000 46.80260
## 4        Michigan  7.500000 24.25000 14.00000 6.500000 10.250000 43.58259
## 5  Michigan State  8.666667 21.33333 12.33333 5.000000 11.000000 43.05701
## 6       Minnesota  7.666667 32.00000 16.66667 3.333333  8.666667 48.28042
## 7        Nebraska  8.000000 27.33333 14.00000 7.333333 14.666667 49.22807
## 8    Northwestern  7.333333 29.33333 14.00000 2.000000 12.000000 45.84628
## 9      Ohio State 13.666667 25.66667 14.33333 6.000000  9.666667 41.78620
## 10     Penn State 10.000000 22.33333 14.33333 5.000000  8.666667 44.90112
## 11         Purdue  5.750000 20.75000 11.25000 7.500000  6.500000 46.92774
## 12        Rutgers  9.000000 24.00000 11.00000 5.666667 12.333333 39.08730
## 13      Wisconsin 10.666667 29.33333 12.66667 3.000000  9.666667 42.15583
##    ThreePTP
## 1  32.96620
## 2  32.41228
## 3  31.91142
## 4  36.57895
## 5  39.84127
## 6  37.89683
## 7  37.04429
## 8  34.09091
## 9  31.63743
## 10 43.00797
## 11 35.41667
## 12 31.41270
## 13 26.24644

Then, the aggregation by season was done to find season averages.

MW_conf_avg_by_season <- aggregate(cbind(OREB, DREB, AST, STL, TO, FGP, ThreePTP) ~ opponent + season, data = IU_BS_confMW, FUN = mean, na.rm = TRUE)
MW_conf_avg_by_season

##          opponent season OREB DREB  AST STL   TO      FGP ThreePTP
## 1        Illinois   2022  7.0 27.0 13.5 4.0  8.5 41.07143 26.53846
## 2            Iowa   2022 12.5 21.0 19.5 6.5 16.5 49.28122 29.06699
## 3        Maryland   2022  5.0 28.0 16.0 5.5 10.5 51.45390 34.23077
## 4        Michigan   2022  8.0 21.0 16.5 6.5 10.0 42.88642 39.82456
## 5  Michigan State   2022 12.0 22.0 14.0 6.0 11.0 33.89831 23.80952
## 6       Minnesota   2022  6.5 30.5 16.5 4.0  8.0 51.88492 42.55952
## 7        Nebraska   2022  8.5 27.0 10.0 7.0 14.5 47.17544 33.56643
## 8    Northwestern   2022  6.0 28.0 13.0 3.0  7.0 37.03704 25.00000
## 9      Ohio State   2022 13.0 26.5 13.0 6.5 10.0 37.67930 22.45614
## 10     Penn State   2022  8.5 21.0 14.0 7.0  7.5 45.31778 50.22624
## 11         Purdue   2022  6.0 26.0 11.5 7.5  6.0 43.09524 27.50000
## 12        Rutgers   2022  9.0 22.0  9.0 7.0  9.0 41.07143 28.57143
## 13      Wisconsin   2022 11.5 27.5 14.0 2.5 10.5 39.84664 33.11966
## 14       Illinois   2023 12.0 28.0 14.0 6.0 15.5 52.87879 39.39394
## 15           Iowa   2023  7.5 22.0 16.0 7.0 12.0 48.67037 35.75758
## 16       Maryland   2023 12.0 28.0 11.0 2.0 12.0 37.50000 27.27273
## 17       Michigan   2023  7.0 27.5 11.5 6.5 10.5 44.27876 33.33333
## 18 Michigan State   2023  7.0 21.0 11.5 4.5 11.0 47.63636 47.85714
## 19      Minnesota   2023 10.0 35.0 17.0 2.0 10.0 41.07143 28.57143
## 20       Nebraska   2023  7.0 28.0 22.0 8.0 15.0 53.33333 44.00000
## 21   Northwestern   2023  8.0 30.0 14.5 1.5 14.5 50.25090 38.63636
## 22     Ohio State   2023 15.0 24.0 17.0 5.0  9.0 50.00000 50.00000
## 23     Penn State   2023 13.0 25.0 15.0 1.0 11.0 44.06780 28.57143
## 24         Purdue   2023  5.5 15.5 11.0 7.5  7.0 50.76023 43.33333
## 25        Rutgers   2023  9.0 25.0 12.0 5.0 14.0 38.09524 32.83333
## 26      Wisconsin   2023  9.0 33.0 10.0 4.0  8.0 46.77419 12.50000

These values will be listed in an organized set of tables under the conclusion tab.

Conference Opponent Average Conclusions

Conference Opponent Average Conclusions

Overall, the analyst found that 10/13 conference opponents saw better team averages in more game components (OREB, DREB, AST, STL, TO, FGP, and ThreePTP) from Woodson’s first two years in comparison to Archie Miller’s last two years. For example, the Hoosiers under Woodson outranked the Hoosiers under Miller in average field goal percentage against 11/13 of their conference opponents. Evidence like this shows that Woodson has the programming move in the right direction.

The analyst also looked at the shooting percentages between Woodson’s first and second year as head coach and found that each field goal percentage and three point percentage saw improvement 8/13 times for their conference opponents. Not only is Woodson improving from Miller’s averages but, he appears to have improved on his own averages too.

For more details on the relationships between the teams performance under a given coach and a specific opponent, navigate through the tabs below:

Illinois

Illinois

The table for average difference between each coach and their respective seasons is pasted below:

To start, the analyst looked at the difference in the overall averages between coaches.

Overall aspects of the game against Illinois that were better for Archie Miller than for Mike Woodson include:

Steals (STL)
Turn Overs (TO)
Three Point Percentage (ThreePTP)

Overall aspects of the game against Illinois that are better for Mike Woodson than for Archie Miller include:

Offensive Rebounds (OREB)
Defensive Rebounds (DREB)
Assists (AST)
Field Goal Percentage (FGP)

The analyst then looked at Coach Woodson’s averages and compared his first season against Illinois to his second. From this, it is clear that under Coach Woodson’s guidance the Hoosiers improved nearly all aspects of the game against Illinois. The only one that didn’t improve was the number of turnovers. However, more impressive is the increase in shooting averages by nearly 10% (both field goal percentage and three-point percentage). It appears that Woodson has a strong game plan against Illinois.

Iowa

Iowa

The table for average difference between each coach and their respective seasons is pasted below:

To start, the analyst looked at the difference in the overall averages between coaches.

Overall aspects of the game against Iowa that were better for Archie Miller than for Mike Woodson include:

Offensive Rebounds (OREB)
Defensive Rebounds (DREB)
Steals (STL)
Turn Overs (TO)
Three Point Percentage (ThreePTP)

Overall aspects of the game against Iowa that are better for Mike Woodson than for Archie Miller include:

Assists (AST)
Field Goal Percentage (FGP)

The analyst then looked at Coach Woodson’s averages and compared his first season against Iowa to his second. From this, it is clear that under Coach Woodson’s guidance the Hoosiers improved some aspects of the game against Iowa. These areas included defensive rebounds, steals, turn overs, and three point percentages. While field goal percentage didn’t increase, it stayed about the same under Woodson.

Maryland

Maryland

The table for average difference between each coach and their respective seasons is pasted below:

To start, the analyst looked at the difference in the overall averages between coaches.

Overall aspects of the game against Maryland that were better for Archie Miller than for Mike Woodson include:

Offensive Rebounds (OREB)
Turn Overs (TO)

Overall aspects of the game against Maryland that are better for Mike Woodson than for Archie Miller include:

Offensive Rebounds (OREB)
Defensive Rebounds (DREB)
Assists (AST)
Steals (STL)
Field Goal Percentage (FGP)
Three Point Percentage (ThreePTP)

The analyst then looked at Coach Woodson’s averages and compared his first season against Maryland to his second. From this, it is clear that Coach Woodson has struggled against this opponent in some areas of play. In particular, large decreases can be found in the shooting percentages (field goal percentage down 14% and three point percentage down 6.9%).

Michigan

Michigan

The table for average difference between each coach and their respective seasons is pasted below:

To start, the analyst looked at the difference in the overall averages between coaches.

Overall aspects of the game against Michigan that were better for Archie Miller than for Mike Woodson include:

Turn Overs (TO)

Overall aspects of the game against Michigan that are better for Mike Woodson than for Archie Miller include:

Offensive Rebounds (OREB)
Defensive Rebounds (DREB)
Assists (AST)
Steals (STL)
Field Goal Percentage (FGP)
Three Point Percentage (ThreePTP)

The analyst then looked at Coach Woodson’s averages and compared his first season against Michigan to his second. The analyst found that while values may have increased or decreased, they all stand around the same amounts. For example, in year one the number of offensive rebounds were 8.0 and in his second year there were 7. Similarly, in his first year the field goal percentage was 42.9% and in his second year it was at 44.3%. An area of notable improvement is the number of defensive rebounds moving from 21.0 to 27.5. Overall, Woodson is developing against Michigan.

Michigan State

Michigan State

The table for average difference between each coach and their respective seasons is pasted below:

To start, the analyst looked at the difference in the overall averages between coaches.

Overall aspects of the game against Michigan State that were better for Archie Miller than for Mike Woodson include:

Offensive Rebounds (OREB)
Defensive Rebounds (DREB)
Turn Overs (TO)
Steals (STL)

Overall aspects of the game against Michigan State that are better for Mike Woodson than for Archie Miller include:

Offensive Rebounds (OREB)
Defensive Rebounds (DREB)
Assists (AST)
Field Goal Percentage (FGP)
Three Point Percentage (ThreePTP)

The analyst then looked at Coach Woodson’s averages and compared his first season against Michigan State to his second. The analyst found that while many values decreased, the shooting percentages of the team actually increased over his two years. These increases were not small as they both increased by a value of more than 10%. Similar to Maryland though, it appears that in most aspects of the game Woodson struggled against Michigan State.

Minnesota

Minnesota

The table for average difference between each coach and their respective seasons is pasted below:

To start, the analyst looked at the difference in the overall averages between coaches.

Overall aspects of the game against Minnesota that were better for Archie Miller than for Mike Woodson include:

Offensive Rebounds (OREB)
Steals (STL)
Field Goal Percentage (FGP)

Overall aspects of the game against Minnesota that are better for Mike Woodson than for Archie Miller include:

Defensive Rebounds (DREB)
Assists (AST)
Turn Overs (TO)
Three Point Percentage (ThreePTP)

The analyst then looked at Coach Woodson’s averages and compared his first season against Minnesota to his second. The analyst found a general increase in all varialbes except, the shooting percentages, turn overs, and steals. While the differences in steals and turn overs were generally small, the differences in shooting percentages was nearly a 10% decrease. However, even with this being the case Woodson has still amounted a 3-0 record against Minnesota. Thus, it as not as concerning.

Nebraska

Nebraska

The table for average difference between each coach and their respective seasons is pasted below:

To start, the analyst looked at the difference in the overall averages between coaches.

Overall aspects of the game against Nebraska that were better for Archie Miller than for Mike Woodson include:

Offensive Rebounds (OREB)
Defensive Rebounds (DREB)
Assists (AST)
Turn Overs (TO)

Overall aspects of the game against Nebraska that are better for Mike Woodson than for Archie Miller include:

Steals (STL)
Field Goal Percentage (FGP)
Three Point Percentage (ThreePTP)

The analyst then looked at Coach Woodson’s averages and compared his first season against Nebraska to his second. The analyst found that nearly all components of the game improved from his first season to his second season. The largest changes coming from assists (up 12) and three-point percentage (up 10%). Both coaches have a 3-0 record against this team. Thus, any values that Miller has posted better are less of a concern as they have still shown to be effective against this given Nebraska.

Northwestern

Northwestern

The table for average difference between each coach and their respective seasons is pasted below:

To start, the analyst looked at the difference in the overall averages between coaches.

Overall aspects of the game against Northwestern that were better for Archie Miller than for Mike Woodson include:

Offensive Rebounds (OREB)
Steals (STL)

Overall aspects of the game against Northwestern that are better for Mike Woodson than for Archie Miller include:

Defensive Rebounds (DREB)
Assists (AST)
Turn Overs (TO)
Field Goal Percentage (FGP)
Three Point Percentage (ThreePTP)

The analyst then looked at Coach Woodson’s averages and compared his first season against Northwestern to his second. The analyst found that nearly all components of the game improved from his first season as coach to his second season as coach. The largest changes coming from average field goal percentage (up 13.3%) and three-point percentage (up 13.6%). Given this, Woodson is improving the level of play against Northwestern.

Ohio State

Ohio State

The table for average difference between each coach and their respective seasons is pasted below:

To start, the analyst looked at the difference in the overall averages between coaches.

Overall aspects of the game against Ohio State that were better for Archie Miller than for Mike Woodson include:

Field Goal Percentage (FGP)
Three Point Percentage (ThreePTP)

Overall aspects of the game against Ohio State that are better for Mike Woodson than for Archie Miller include:

Offensive Rebounds (OREB)
Defensive Rebounds (DREB)
Assists (AST)
Turn Overs (TO)

The analyst then looked at Coach Woodson’s averages and compared his first season against Ohio State to his second. The analyst found that nearly all components of the game improved from his first season as coach to his second season as coach. The largest changes coming from average field goal percentage (up 12.3%) and three-point percentage (up 27.5%). While his first year against Ohio State may have been rough, Woodson appears to have found an approach in his second season to help his team be more successful.

Penn State

Penn State

The table for average difference between each coach and their respective seasons is pasted below:

To start, the analyst looked at the difference in the overall averages between coaches.

Overall aspects of the game against Penn State that were better for Archie Miller than for Mike Woodson include:

Defensive Rebounds (DREB)
Steals (STL)

Overall aspects of the game against Penn State that are better for Mike Woodson than for Archie Miller include:

Offensive Rebounds (OREB)
Assists (AST)
Turn Overs (TO)
Field Goal Percentage (FGP)
Three Point Percentage (ThreePTP)

The analyst then looked at Coach Woodson’s averages and compared his first season against Penn State to his second. The analyst found that some components of the game improved, while others fell short of the season before. The largest changes coming from an improvement in average offensive rebounds (up 4) and a decease in three-point percentage by nearly 21.6%. However, even with the large decrease in three point percentage, the field goal percentage still posted at about the same value at 45.3% the first year and 44.1% in the second. Thus, while there are some concerns about competing against Penn State, it appears that Woodson is still holding the team at a competitive level against them.

Purdue

Purdue

The table for average difference between each coach and their respective seasons is pasted below:

To start, the analyst looked at the difference in the overall averages between coaches.

Overall aspects of the game against Purdue that were better for Archie Miller than for Mike Woodson include:

Offensive Rebounds (OREB)
Assists (AST)

Overall aspects of the game against Purdue that are better for Mike Woodson than for Archie Miller include:

Defensive Rebounds (DREB)
Steals (STL)
Turn Overs (TO)
Field Goal Percentage (FGP)
Three Point Percentage (ThreePTP)

The analyst then looked at Coach Woodson’s averages and compared his first season against Purdue to his second. The analyst found that some components of the game improved, while others fell short of the season before. The most notable differences occur in the shooting percentages. Field goal percentage improving by about 7%, while three point percentage saw a lofty improvement of 15.8%. With such large increases in shooting percentages it does not come as a surprise that rebounds were down in the second year. It appears that Coach Woodson has improved the program against Purdue.

Rutgers

Rutgers

The table for average difference between each coach and their respective seasons is pasted below:

To start, the analyst looked at the difference in the overall averages between coaches.

Overall aspects of the game against Rutgers that were better for Archie Miller than for Mike Woodson include:

Offensive Rebounds (OREB)
Defensive Rebounds (DREB)
Assists (AST)
Steals (STL)
Three Point Percentage (ThreePTP)

Overall aspects of the game against Rutgers that are better for Mike Woodson than for Archie Miller include:

Turn Overs (TO)
Field Goal Percentage (FGP)

The analyst then looked at Coach Woodson’s averages and compared his first season against Rutgers to his second. The analyst found that some components of the game improved, while others fell short of the season before. A notable difference is that both shooting percentages decreased in his second year. However, neither field goal percentage nor three point percentage saw decreases more than 5%. Other areas also saw some improvement like defensive rebounds and assists. All in all, there are some concerns about Coach Woodson’s performance against Rutgers.

Wisconsin

Wisconsin

The table for average difference between each coach and their respective seasons is pasted below:

To start, the analyst looked at the difference in the overall averages between coaches.

Overall aspects of the game against Wisconsin that were better for Archie Miller than for Mike Woodson include:

Assists (AST)
Steals (STL)
Three Point Percentage (ThreePTP)

Overall aspects of the game against Wisonsin that are better for Mike Woodson than for Archie Miller include:

Offensive Rebounds (OREB)
Defensive Rebounds (DREB)
Turn Overs (TO)
Field Goal Percentage (FGP)

The analyst then looked at Coach Woodson’s averages and compared his first season against Wisconsin to his second. The analyst found that some components of the game improved, while others fell short of the season before. One area worth noting is both shooting percentages. Field goal percentages increased by nearly 7% between the first and second season. However, three point percentage fell more than 20% between the two years. There were other areas that saw good improvement like defensive rebounds and turnovers. With all of this being taken into consideration, there are still some concerns about Coach Woodson’s performance against Wisconsin.

Clustering Average Team Perfomances Against Conference Opponents

The analyst then looked to cluster the average team performance by opponent within their conference. This was done for each coach so that the cluster results between the two coaches could be compared. As demonstrated below, the analyst used k-means clustering. First, the process was done for Archie Miller’s two season as head coach. The data that is being used to generate the clustering was printed for reference of team identification within each cluster.

AM_conf_avg

##          opponent      OREB     DREB      AST      STL       TO      FGP
## 1        Illinois  8.666667 26.33333 11.00000 5.333333 11.00000 40.35796
## 2            Iowa 13.333333 25.00000 15.33333 8.333333 11.33333 43.63191
## 3        Maryland 12.000000 27.33333 13.00000 2.333333 10.00000 41.99510
## 4        Michigan  7.000000 17.00000  7.50000 2.500000  9.00000 42.18159
## 5  Michigan State 10.000000 22.33333 11.66667 7.333333  9.00000 40.73365
## 6       Minnesota  9.000000 27.00000 14.66667 4.333333 12.33333 51.02323
## 7        Nebraska 14.333333 33.33333 16.33333 3.666667 13.00000 48.01078
## 8    Northwestern 13.000000 26.00000 13.00000 8.333333 15.00000 40.19928
## 9      Ohio State  7.666667 21.33333 11.00000 6.000000 13.00000 42.68481
## 10     Penn State  8.000000 27.00000 10.33333 6.000000 14.00000 44.35626
## 11         Purdue  9.250000 20.50000 12.00000 5.750000 12.00000 37.72054
## 12        Rutgers 10.250000 24.25000 12.25000 5.750000 12.50000 37.02235
## 13      Wisconsin  8.000000 24.66667 13.33333 3.666667 10.66667 41.62329
##    ThreePTP
## 1  46.29630
## 2  41.84224
## 3  29.25749
## 4  25.83333
## 5  21.46199
## 6  35.63492
## 7  29.42308
## 8  31.21693
## 9  47.22222
## 10 28.49168
## 11 23.14312
## 12 32.49269
## 13 37.93651

Now, the k-means clustering model can be made using only the numeric columns of data. The analyst decided that three clusters was appropriate, in hopes of having a high performance cluster, mid performance cluster, and a low performance cluster. The code to do this can be seen in the window below:

set.seed(1234)

#fit the model
fit <- kmeans(AM_conf_avg[, 2:8], 3)#3 is the number of clusters we want to build
fit

## K-means clustering with 3 clusters of sizes 6, 3, 4
## 
## Cluster means:
##        OREB     DREB      AST      STL       TO      FGP ThreePTP
## 1 11.097222 27.48611 13.26389 5.069444 12.80556 43.76783 31.08613
## 2  8.750000 19.94444 10.38889 5.194444 10.00000 40.21193 23.47948
## 3  9.416667 24.33333 12.66667 5.833333 11.50000 42.07449 43.32432
## 
## Clustering vector:
##  [1] 3 3 1 2 2 1 1 1 3 1 2 1 3
## 
## Within cluster sum of squares by cluster:
## [1] 302.47030  70.36058 123.28281
##  (between_SS / total_SS =  64.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Once the clusters have been made, it was time to analyze the results of the k-mean model. The clusters appear to follow the general pattern the analyst was hoping for. Below are the results of the clusters found from Indiana team performance averages during Archie Miller’s last two years as head coach:

Cluster 1- Maryland, Minnesota, Nebraska, Northwestern, Penn State, and Rutgers
- The teams under Archie Miller that the Hoosiers had the best average performance against. In general, this cluster posts the highest averages in all game attributes except steals and turnovers. These are the teams that Miller posted the best results against. Notice that there are 6 teams in this category.
Cluster 2- Michigan, Michigan State, and Purdue
- The teams under Archie Miller that the Hoosiers had the worst performance against. In general, this cluster posts the lowest averages in all game attributes except steals and turnovers. These are the teams that the Hoosiers posted the worst results against within Miller’s last two seasons. The average performance of the team was just much lower when playing these teams. Notice that there are only 3 teams in this category.
Cluster 3- Illinois, Iowa, Ohio State, and Wisconsin
- The teams under Archie Miller that the Hoosiers had a mid-level performance against. In general, this cluster posts the median averages in all game attributes for Millers. last two seasons. These are the teams that Miller appeared to neither seriously struggle nor compete particularly well. Notice that this cluster includes 4 teams.

Now, same was done for Mike Woodson’s first two years as head coach. The dataset being used will be printed, so it is available for reference.

MW_conf_avg

##          opponent      OREB     DREB      AST      STL        TO      FGP
## 1        Illinois  9.500000 27.50000 13.75000 5.000000 12.000000 46.97511
## 2            Iowa 10.000000 21.50000 17.75000 6.750000 14.250000 48.97580
## 3        Maryland  7.333333 28.00000 14.33333 4.333333 11.000000 46.80260
## 4        Michigan  7.500000 24.25000 14.00000 6.500000 10.250000 43.58259
## 5  Michigan State  8.666667 21.33333 12.33333 5.000000 11.000000 43.05701
## 6       Minnesota  7.666667 32.00000 16.66667 3.333333  8.666667 48.28042
## 7        Nebraska  8.000000 27.33333 14.00000 7.333333 14.666667 49.22807
## 8    Northwestern  7.333333 29.33333 14.00000 2.000000 12.000000 45.84628
## 9      Ohio State 13.666667 25.66667 14.33333 6.000000  9.666667 41.78620
## 10     Penn State 10.000000 22.33333 14.33333 5.000000  8.666667 44.90112
## 11         Purdue  5.750000 20.75000 11.25000 7.500000  6.500000 46.92774
## 12        Rutgers  9.000000 24.00000 11.00000 5.666667 12.333333 39.08730
## 13      Wisconsin 10.666667 29.33333 12.66667 3.000000  9.666667 42.15583
##    ThreePTP
## 1  32.96620
## 2  32.41228
## 3  31.91142
## 4  36.57895
## 5  39.84127
## 6  37.89683
## 7  37.04429
## 8  34.09091
## 9  31.63743
## 10 43.00797
## 11 35.41667
## 12 31.41270
## 13 26.24644

Now, the analyst must generate the model, adapting the previous code for the new data set. Understand that the analyst still aims to create 3 clusters for this model.

set.seed(1234)

#fit the model
fit <- kmeans(MW_conf_avg[, 2:8], 3)#3 is the number of clusters we want to build
fit

## K-means clustering with 3 clusters of sizes 3, 4, 6
## 
## Cluster means:
##        OREB     DREB      AST      STL        TO      FGP ThreePTP
## 1 11.111111 26.33333 12.66667 4.888889 10.555556 41.00978 29.76552
## 2  7.979167 22.16667 12.97917 6.000000  9.104167 44.61711 38.71121
## 3  8.305556 27.61111 15.08333 4.791667 12.097222 47.68471 34.38699
## 
## Clustering vector:
##  [1] 3 3 3 2 2 3 3 3 1 2 2 1 1
## 
## Within cluster sum of squares by cluster:
## [1]  65.99112  83.55141 166.08657
##  (between_SS / total_SS =  53.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Cluster 1- Ohio State, Rutgers, Wisconsin
- The teams under Mike Woodson that the Hoosiers had the worst average performance against. In general, this cluster posts the lowest averages in both shooting percentages, as well as assists. These are the teams that Woodson has posted the worst results against within Woodson’s first two seasons. Notice that there are only 3 teams in this category.
Cluster 2- Michigan, Michigan State, Penn State, and Purdue
- The teams under Archie Miller that the Hoosiers had a mid-level performance against. In general, this cluster posts the median averages in shooting percentages, as well as assists. These are the teams that Woodson appeared to neither seriously struggle against nor compete particularly well against. Notice that this cluster has 4 teams.
Cluster 3- Illinois, Iowa, Maryland, Minnesota, Nebraska, and Northwestern
- The teams under Mike Woodson that the Hoosiers had a the best average performance against. In general, this cluster posts the best averages in both shooting percentages, as well as defensive rebounds and assists. These are the teams that Miller posted the best results against. Notice that there are once again 6 teams in this category.

After observing each individual coaching trends, it was then time to compare them against each other. First, it is important to notice the overall shooting percentage in Woodson’s clustering are higher than the shooting percentage averages used for Miller’s clustering. The maximum FGP in Miller’s was 43.8%, while Woodson’s was 47.7%. However, Miller had the highest three point average with a cluster at 43.3% compared to Woodson’s cluster average at 38.7%. Miller also posts the lowest three point average at 23.5% in comparison to Woodson’s 29.8%. Thus the range of Woodson’s values appear to more narrow than that of Miller. This does provide evidence that on average, Woodson’s teams, play consistently better than Miller’s teams.

After looking at the difference in averages for the clusters, the teams were then compared within each cluster to see any similarities in opponents between the two coaches.

Both coaches had Maryland, Minnesota, Nebraska, and Northwestern in their team average high performance clusters.
The coaches did not have any similarities in their respective team average mid-level performance teams in common within their clusters.
The coaches did not have any similarities in their respective team average low level performance teams in common within their clusters.

Field Goal Percentage Prediction for Next Season

The analyst then looked to build a decision tree model to predict the average field goal percentage for the Hoosiers under Coach Woodson for the 2023-24 season. This model will be built using all of the data for the past two seasons of play and the prediction will come from the average values of the most recent season.

First, start by getting the data ready for the decision tree:

IU_BS_confMW <- IU_BS_confMW[,-(10:13)]
IU_BS_confMW <- IU_BS_confMW[,-4]
IU_BS_confMW <- IU_BS_confMW[,-1]

Now, build the decision tree model:

rpart2024 <- rpart(formula = FGP ~ ., data = IU_BS_confMW)
prp(rpart2024, digits =4, extra = 1)

Now, use the averages from the past season to forecast where the field goal percentage for Mike Woodson will be for his next season.

summary(IU_BS_confMW)

##       OREB             DREB            AST             STL        
##  Min.   : 2.000   Min.   :14.00   Min.   : 6.00   Min.   : 1.000  
##  1st Qu.: 7.000   1st Qu.:22.50   1st Qu.:11.00   1st Qu.: 4.000  
##  Median : 9.000   Median :25.00   Median :14.00   Median : 5.000  
##  Mean   : 8.791   Mean   :25.44   Mean   :13.91   Mean   : 5.302  
##  3rd Qu.:10.500   3rd Qu.:29.00   3rd Qu.:16.00   3rd Qu.: 7.000  
##  Max.   :15.000   Max.   :35.00   Max.   :22.00   Max.   :11.000  
##       BLK               TO              PF            FGP       
##  Min.   : 1.000   Min.   : 3.00   Min.   :10.0   Min.   :30.36  
##  1st Qu.: 3.000   1st Qu.: 9.00   1st Qu.:15.0   1st Qu.:40.98  
##  Median : 4.000   Median :10.00   Median :18.0   Median :45.16  
##  Mean   : 4.442   Mean   :10.81   Mean   :17.3   Mean   :45.33  
##  3rd Qu.: 6.000   3rd Qu.:13.00   3rd Qu.:20.0   3rd Qu.:49.44  
##  Max.   :10.000   Max.   :23.00   Max.   :25.0   Max.   :61.82  
##     ThreePTP          FTP        
##  Min.   :12.50   Min.   : 46.15  
##  1st Qu.:26.79   1st Qu.: 66.67  
##  Median :31.58   Median : 71.43  
##  Mean   :34.62   Mean   : 71.98  
##  3rd Qu.:40.83   3rd Qu.: 78.17  
##  Max.   :76.92   Max.   :100.00

Based on the decision tree and the current trends of Mike Woodson, the field goal percentage next season should be around the 47% mark for the Hoosiers. Understand that there are limitations to this prediction, as the model was not evaluated for accuracy (since there are only 40 rows of data). However, it is still a useful and realistic benchmark for Woodson to aim for within the next season.

Player Development

Player Development

After gathering general team information, the analyst set out to compare individual player development between the two different coaches within the two year time span of data at hand. The analyst decided that the following areas would be looked into further to see if player development was occurring for each of the coaches:

Field Goal Percentage (FGP)
Three Point Percentage (ThreePTP)
Points Scored (PTS)
Assists (AST)
Rebounds (REB)

Manipulating the Data for Analysis

Filter the data to only display individual information.

player_BS <- filter(IU_BoxScores, player != "TEAM")

After filtering to remove rows consisting of team information, the data was then filtered to only include conference opponent games. Since there was consistency in these opponents across both coaches, this acts as a control group.

player_BS_conf <- filter(player_BS, opponent == "Illinos" | opponent == "Iowa" |  opponent == "Maryland" | opponent == "Michigan" | opponent == "Michigan State" | opponent == "Minnesota" | opponent == "Nebraska" | opponent == "Northwestern" | opponent == "Ohio State" | opponent == "Penn State" | opponent == "Purdue" | opponent == "Rutgers" | opponent == "Wisconsin" | opponent == "Illinois")

Players Averaging Over 15 Minutes Per Game

Players Averaging Over 15 Minutes Per Game

After the data was filtered to only include conference games, the analyst decided to only look at players who averaged 15 minutes of play or more during each coaches two year period. This will narrow the focus group and will remove any outliers from the data. This was done by grouping the data by each player and finding the average number of minutes played.

#creating a dateset for conference play for Archie Miller
players_confAM <- filter(player_BS_conf, season == "2020" | season == "2021")

#grouping by players
playersAM <- group_by(players_confAM, player)

#find average minutes overall
players_confAM_MIN <- summarise(playersAM, type = mean(MIN, na.rm = TRUE))
arrange(players_confAM_MIN, desc(type))

## # A tibble: 17 × 2
##    player                type
##    <chr>                <dbl>
##  1 Trayce Jackson-Davis 32.5 
##  2 Justin Smith         31   
##  3 Al Durham            29.8 
##  4 Rob Phinisee         26.9 
##  5 Race Thompson        21.8 
##  6 Devonte Green        20.6 
##  7 Joey Brunk           19.0 
##  8 Trey Galloway        18.8 
##  9 Armaan Franklin      18.7 
## 10 Jerome Hunter        18.0 
## 11 Anthony Leal         11.3 
## 12 Damezi Anderson       9.88
## 13 De'Ron Davis          9.45
## 14 Khristian Lander      9.42
## 15 Jordan Geronimo       7.76
## 16 Cooper Bybee          0.5 
## 17 Nathan Childress      0.5

For conference games, the following players were found to average over the 15 minute mark:

Trayce Jackson-Davis
Justin Smith
Al Durham
Rob Phinisee
Race Thompson
Devonte Green
Joey Brunk
Trey Galloway
Armaan Franklin
Jerome Hunter

After discovering this, the data was then filtered one more time so that only these players would be included in the analysis.

playerAM_conf15 <- filter(players_confAM, player == "Trayce Jackson-Davis" | player == "Justin Smith" | player == "Al Durham" | player == "Rob Phinisee" | player == "Race Thompson" | player == "Devonte Green" | player == "Joey Brunk" | player == "Trey Galloway" | player == "Armaan Franklin" | player == "Jerome Hunter")

Then, the same was done for the two seasons that Woodson was head coach.

#creating a dateset for conference play for Archie Miller
players_confMW <- filter(player_BS_conf, season == "2022" | season == "2023")

#grouping by players
playersMW <- group_by(players_confMW, player)

#find average minutes overall
players_confMW_MIN <- summarise(playersMW, type = mean(MIN, na.rm = TRUE))
arrange(players_confMW_MIN, desc(type))

## # A tibble: 21 × 2
##    player                type
##    <chr>                <dbl>
##  1 Jalen Hood-Schifino   34.7
##  2 Trayce Jackson-Davis  34.6
##  3 Xavier Johnson        29.4
##  4 Miller Kopp           28.5
##  5 Trey Galloway         27.1
##  6 Race Thompson         26.9
##  7 Parker Stewart        24.6
##  8 Rob Phinisee          18.7
##  9 Tamar Bates           15.7
## 10 Malik Reneau          14.8
## # … with 11 more rows

For conference games, the following players were found to average over the 15 minute mark:

Jalen Hood-Schifino
Trayce Jackson-Davis
Xavier Johnson
Miller Kopp
Trey Galloway
Race Thompson
Parker Stewart
Rob Phinisee
Tamar Bates

After discovering this, the data was then filtered one more time so that only these players would be included in the analysis.

playerMW_conf15 <- filter(players_confMW, player == "Jalen Hood-Schifino" | player == "Trayce Jackson-Davis" | player == "Xavier Johnson" | player == "Miller Kopp" | player == "Trey Galloway" | player == "Race Thompson" | player == "Parker Stewart" | player == "Rob Phinisee" | player == "Tamar Bates")

FGP

FGP

Since some games may have 0 posted attempts for field goals, the analyst has to look at each shooting percentage by itself to ensure that no undefined values are derived. First, start by creating a data set with just season, player, position, FGM and FGA for Archie Miller.

columns <- c(1:3, 5:6)
playerAM_conf15_FG <- playerAM_conf15[,columns]
playerAM_conf15_FG <- filter(playerAM_conf15_FG, FGA != 0)

Now, create the new variable FGP with the following code:

playerAM_conf15_FG$FGP <- (playerAM_conf15_FG$FGM / playerAM_conf15_FG$FGA) * 100

Now, do the same for Mike Woodson. Starting by filtering the data accordingly.

columns <- c(1:3, 5:6)
playerMW_conf15_FG <- playerMW_conf15[,columns]
playerMW_conf15_FG <- filter(playerMW_conf15_FG, FGA != 0)

Then, generate the new variable:

playerMW_conf15_FG$FGP <- (playerMW_conf15_FG$FGM / playerMW_conf15_FG$FGA) * 100

Now, aggregate the data in a couple different ways. First find the mean of each position by coach. The table below shows information for Coach Miller.

aggregate(cbind(FGP) ~ position, data = playerAM_conf15_FG, FUN = mean, na.rm = TRUE)

##   position      FGP
## 1        F 44.17147
## 2        G 33.33301

Then, the same was done for Coach Woodson.

aggregate(cbind(FGP) ~ position, data = playerMW_conf15_FG, FUN = mean, na.rm = TRUE)

##   position      FGP
## 1        F 49.93120
## 2        G 35.54146

As shown above, Woodson’s by position field goal percentage average is higher than Miller’s. Now, the analyst looked to take this a step further by looking at position by season.

aggregate(cbind(FGP) ~ season + position, data = playerAM_conf15_FG, FUN = mean, na.rm = TRUE)

##   season position      FGP
## 1   2020        F 43.06275
## 2   2021        F 45.94923
## 3   2020        G 32.56220
## 4   2021        G 34.15736

Both positions for Miller saw increase over the two year span. Now, see if the same holds true for Woodson.

aggregate(cbind(FGP) ~ season + position, data = playerMW_conf15_FG, FUN = mean, na.rm = TRUE)

##   season position      FGP
## 1   2022        F 49.29014
## 2   2023        F 50.73846
## 3   2022        G 34.59260
## 4   2023        G 37.08536

Similarly, both positions for Woodson saw growth over his two years as coach. It is clear to see that with Woodson as coach the average FGP has been higher regardless of position and season. Next, the analyst wanted to look into individual player growth. First, look at Archie Miller’s player development.

aggregate(cbind(FGP) ~ season + player, data = playerAM_conf15_FG, FUN = mean, na.rm = TRUE)

##    season               player      FGP
## 1    2020            Al Durham 39.32738
## 2    2021            Al Durham 34.61451
## 3    2020      Armaan Franklin 26.06481
## 4    2021      Armaan Franklin 38.98416
## 5    2020        Devonte Green 29.66818
## 6    2020        Jerome Hunter 29.92997
## 7    2021        Jerome Hunter 43.59127
## 8    2020           Joey Brunk 45.62500
## 9    2020         Justin Smith 46.86273
## 10   2020        Race Thompson 39.47917
## 11   2021        Race Thompson 44.65602
## 12   2020         Rob Phinisee 34.64271
## 13   2021         Rob Phinisee 29.69719
## 14   2020 Trayce Jackson-Davis 50.73025
## 15   2021 Trayce Jackson-Davis 49.36461
## 16   2021        Trey Galloway 34.60784

Of the 10 players listed, four only had data available for one season. Thus, six players could be compared from one season to the next. In doing this, the analyst discovered that there were three players (of the six) who actually saw a decrease in their in-conference FGP between the 2019-20 season and the 2020-21 season. Looking at the chart they were found to be: Al Durham, Rob Phinisee, and Trayce Jackson-Davis. However, understand and proceed with caution as some players experienced an injury in the latter season.

Now, the same was evaluated for Coach Woodson.

aggregate(cbind(FGP) ~ season + player, data = playerMW_conf15_FG, FUN = mean, na.rm = TRUE)

##    season               player      FGP
## 1    2023  Jalen Hood-Schifino 40.54527
## 2    2022          Miller Kopp 36.17637
## 3    2023          Miller Kopp 50.71537
## 4    2022       Parker Stewart 33.09163
## 5    2022        Race Thompson 53.47147
## 6    2023        Race Thompson 47.94218
## 7    2022         Rob Phinisee 26.81490
## 8    2022          Tamar Bates 27.93891
## 9    2023          Tamar Bates 25.73517
## 10   2022 Trayce Jackson-Davis 57.65243
## 11   2023 Trayce Jackson-Davis 52.71895
## 12   2022        Trey Galloway 48.72222
## 13   2023        Trey Galloway 46.30357
## 14   2022       Xavier Johnson 38.46749
## 15   2023       Xavier Johnson 21.59091

Of the 9 players listed, three only had data available for one season. Thus, six players could be compared from one season to the next. In doing this, the analyst discovered that there were four players (of the six) who actually saw a decrease in their in-conference FGP between the 2021-22 season and the 2022-23 season. Looking at the chart they were found to be: Race Thompson, Tamar Bates, Trey Galloway, and Xavier Johnson. However, understand and proceed with caution as some players experienced an injury in the latter season. While these findings may not be directly due to coaching, it is one aspect to consider when evaluating the coach.

ThreePTP

ThreePTP

Since some games may have 0 posted attempts for three pointers, the analyst has to look at each shooting percentage by itself to ensure that no undefined values are derived. First start by creating a data set with just season, player, position, ThreePTM and ThreePTA for Archie Miller.

columns <- c(1:3, 7:8)
playerAM_conf15_3P <- playerAM_conf15[,columns]
playerAM_conf15_3P <- filter(playerAM_conf15_3P, ThreePTA != 0)

Now, create the new variable ThreePTP with the following code:

playerAM_conf15_3P$ThreePTP <- (playerAM_conf15_3P$ThreePTM / playerAM_conf15_3P$ThreePTA) * 100

Now, do the same for Mike Woodson. Starting by filtering the data accordingly.

columns <- c(1:3, 7:8)
playerMW_conf15_3P <- playerMW_conf15[,columns]
playerMW_conf15_3P <- filter(playerMW_conf15_3P, ThreePTA != 0)

Then, generate the new variable:

playerMW_conf15_3P$ThreePTP <- (playerMW_conf15_3P$ThreePTM / playerMW_conf15_3P$ThreePTA) * 100

Now, aggregate the data in a couple different ways. First find the mean of each position by coach. The table below shows information for Coach Miller.

aggregate(cbind(ThreePTP) ~ position, data = playerAM_conf15_3P, FUN = mean, na.rm = TRUE)

##   position ThreePTP
## 1        F 27.03125
## 2        G 29.04295

Then, the same was done for Coach Woodson.

aggregate(cbind(ThreePTP) ~ position, data = playerMW_conf15_3P, FUN = mean, na.rm = TRUE)

##   position ThreePTP
## 1        F 35.75964
## 2        G 29.84594

As shown above, Woodson’s by position three point percentage average is higher than Miller’s. In fact, forwards were much more effective under Woodson with a nearly 8% difference in three point shots. Now, the analyst looked to take this a step further by looking at position by season.

aggregate(cbind(ThreePTP) ~ season + position, data = playerAM_conf15_3P, FUN = mean, na.rm = TRUE)

##   season position ThreePTP
## 1   2020        F 27.47748
## 2   2021        F 26.41975
## 3   2020        G 29.75953
## 4   2021        G 28.25036

Both positions for Miller saw a decrease over the two year span. Now, see if the same holds true for Woodson.

aggregate(cbind(ThreePTP) ~ season + position, data = playerMW_conf15_3P, FUN = mean, na.rm = TRUE)

##   season position ThreePTP
## 1   2022        F 34.30786
## 2   2023        F 38.07172
## 3   2022        G 28.06368
## 4   2023        G 32.63702

Conversely, both positions for Woodson saw growth over his two years as coach. It is clear to see that Woodson has helped the team be better three point shooters in his first two seasons. Next, the analyst wanted to look into individual player growth. First, look at Archie Miller’s player development.

aggregate(cbind(ThreePTP) ~ season + player, data = playerAM_conf15_3P, FUN = mean, na.rm = TRUE)

##    season          player ThreePTP
## 1    2020       Al Durham 37.10526
## 2    2021       Al Durham 40.07143
## 3    2020 Armaan Franklin 19.33333
## 4    2021 Armaan Franklin 36.90476
## 5    2020   Devonte Green 32.03896
## 6    2020   Jerome Hunter 30.20833
## 7    2021   Jerome Hunter 33.13725
## 8    2020    Justin Smith 22.22222
## 9    2020   Race Thompson 33.33333
## 10   2021   Race Thompson 15.00000
## 11   2020    Rob Phinisee 28.24561
## 12   2021    Rob Phinisee 18.23308
## 13   2021   Trey Galloway 15.38462

Of the 8 players listed, three only had data available for one season. Thus, five players could be compared from one season to the next. In doing this the analyst discovered that there were two players (of the five) who actually saw a decrease in their in-conference ThreePTP between the 2019-20 season and the 2020-21 season. Looking at the chart, they were found to be: Race Thompson and Rob Phinisee. However, understand and proceed with caution as some players experienced an injury in the latter season.

Now, the same was evaluated for Coach Woodson.

aggregate(cbind(ThreePTP) ~ season + player, data = playerMW_conf15_3P, FUN = mean, na.rm = TRUE)

##    season               player ThreePTP
## 1    2023  Jalen Hood-Schifino 27.48599
## 2    2022          Miller Kopp 35.99567
## 3    2023          Miller Kopp 43.89683
## 4    2022       Parker Stewart 31.73521
## 5    2022        Race Thompson 35.96491
## 6    2023        Race Thompson 21.42857
## 7    2022         Rob Phinisee 17.97052
## 8    2022          Tamar Bates 27.85714
## 9    2023          Tamar Bates 28.80208
## 10   2022 Trayce Jackson-Davis  0.00000
## 11   2022        Trey Galloway 25.00000
## 12   2023        Trey Galloway 43.14815
## 13   2022       Xavier Johnson 32.34848
## 14   2023       Xavier Johnson 12.50000

Of the 9 players listed, four only had data available for one season. Thus, five players could be compared from one season to the next. In doing this the analyst discovered that there were two players (of the five) who actually saw a decrease in their in-conference ThreePTP between the 2021-22 season and the 2022-23 season. Looking at the chart, they were found to be: Race Thompson and Xavier Johnson. However, understand and proceed with caution as some players experienced an injury in the latter season. While these findings may not be directly due to coaching, it is one aspect to consider when evaluating the coach.

PTS

PTS

Since points do not have the ability to be undefined, the analyst just needed to find the averages in a variety of ways. First find the mean of each position by coach. The table below shows information for Coach Miller.

aggregate(cbind(PTS) ~ position, data = playerAM_conf15, FUN = mean, na.rm = TRUE)

##   position      PTS
## 1        F 8.889610
## 2        G 7.585526

Then, the same was done for Coach Woodson.

aggregate(cbind(PTS) ~ position, data = playerMW_conf15, FUN = mean, na.rm = TRUE)

##   position       PTS
## 1        F 11.854839
## 2        G  7.576923

As shown above, Woodson’s forward points by position has a higher average than Miller’s. However, the guards between both coaches appear to contribute about the same number of points per conference game. Now, the analyst looked to take this a step further by looking at position by season.

aggregate(cbind(PTS) ~ season + position, data = playerAM_conf15, FUN = mean, na.rm = TRUE)

##   season position       PTS
## 1   2020        F  7.291667
## 2   2021        F 11.534483
## 3   2020        G  7.037975
## 4   2021        G  8.178082

Both positions for Miller saw an increase over the two year span. Now, see if the same holds true for Woodson.

aggregate(cbind(PTS) ~ season + position, data = playerMW_conf15, FUN = mean, na.rm = TRUE)

##   season position      PTS
## 1   2022        F 11.36232
## 2   2023        F 12.47273
## 3   2022        G  6.84375
## 4   2023        G  8.75000

Similarly, both positions for Woodson saw growth over his two years as coach. Next, the analyst wanted to look into individual player growth. First, look at Archie Miller’s player development.

aggregate(cbind(PTS) ~ season + player, data = playerAM_conf15, FUN = mean, na.rm = TRUE)

##    season               player       PTS
## 1    2020            Al Durham  9.200000
## 2    2021            Al Durham 11.700000
## 3    2020      Armaan Franklin  2.050000
## 4    2021      Armaan Franklin 11.133333
## 5    2020        Devonte Green  9.550000
## 6    2020        Jerome Hunter  3.736842
## 7    2021        Jerome Hunter  7.000000
## 8    2020           Joey Brunk  6.100000
## 9    2020         Justin Smith  9.500000
## 10   2020        Race Thompson  3.588235
## 11   2021        Race Thompson  8.700000
## 12   2020         Rob Phinisee  7.368421
## 13   2021         Rob Phinisee  7.050000
## 14   2020 Trayce Jackson-Davis 12.800000
## 15   2021 Trayce Jackson-Davis 18.450000
## 16   2021        Trey Galloway  3.055556

Of the 10 players listed four only had data available for one season. Thus, six players could be compared from one season to the next. In doing this, the analyst discovered that there was only one players (of the six) who actually saw a decrease in their in-conference PTS between the 2019-20 season and the 2020-21 season. Looking at the chart, it was found to be: Rob Phinisee. However, understand and proceed with caution as some players experienced an injury in the latter season.

Now, the same was evaluated for Coach Woodson.

aggregate(cbind(PTS) ~ season + player, data = playerMW_conf15, FUN = mean, na.rm = TRUE)

##    season               player       PTS
## 1    2023  Jalen Hood-Schifino 14.444444
## 2    2022          Miller Kopp  5.304348
## 3    2023          Miller Kopp  7.700000
## 4    2022       Parker Stewart  5.863636
## 5    2022        Race Thompson 11.826087
## 6    2023        Race Thompson  6.933333
## 7    2022         Rob Phinisee  5.000000
## 8    2022          Tamar Bates  3.000000
## 9    2023          Tamar Bates  4.800000
## 10   2022 Trayce Jackson-Davis 16.956522
## 11   2023 Trayce Jackson-Davis 21.400000
## 12   2022        Trey Galloway  6.400000
## 13   2023        Trey Galloway  7.850000
## 14   2022       Xavier Johnson 13.136364
## 15   2023       Xavier Johnson  6.000000

Of the 9 players listed, three only had data available for one season. Thus, six players could be compared from one season to the next. In doing this the analyst discovered that there were two players (of the six) who actually saw a decrease in their in-conference PTS between the 2021-22 season and the 2022-23 season. Looking at the chart, they were found to be: Race Thompson and Xavier Johnson. However, understand and proceed with caution as some players experienced an injury in the latter season. While these findings may not be directly due to coaching, it is one aspect to consider when evaluating the coach.

AST

AST

Since assists do not have the ability to be undefined, the analyst just needed to find the averages in a variety of ways. First, find the mean of each position by coach. The table below shows information for Coach Miller.

aggregate(cbind(AST) ~ position, data = playerAM_conf15, FUN = mean, na.rm = TRUE)

##   position       AST
## 1        F 0.8766234
## 2        G 2.1052632

Then, the same was done for Coach Woodson.

aggregate(cbind(AST) ~ position, data = playerMW_conf15, FUN = mean, na.rm = TRUE)

##   position      AST
## 1        F 1.774194
## 2        G 2.134615

As shown above, Woodson’s forwards had a higher average in assists than Miller’s. However, the guards between both coaches appear to contribute about the same number of assists per conference game. Now, the analyst looked to take this a step further by looking at position by season.

aggregate(cbind(AST) ~ season + position, data = playerAM_conf15, FUN = mean, na.rm = TRUE)

##   season position       AST
## 1   2020        F 0.7083333
## 2   2021        F 1.1551724
## 3   2020        G 1.9746835
## 4   2021        G 2.2465753

Both positions for Miller saw an increase over the two year span. Now, see if the same holds true for Woodson.

aggregate(cbind(AST) ~ season + position, data = playerMW_conf15, FUN = mean, na.rm = TRUE)

##   season position      AST
## 1   2022        F 1.376812
## 2   2023        F 2.272727
## 3   2022        G 2.177083
## 4   2023        G 2.066667

Similarly, forwards for Woodson saw growth over his two years as coach. However, the guards saw a slight decrease in the number of assists during conference games. Next, the analyst wanted to look into individual player growth. First, look at Archie Miller’s player development.

aggregate(cbind(AST) ~ season + player, data = playerAM_conf15, FUN = mean, na.rm = TRUE)

##    season               player       AST
## 1    2020            Al Durham 2.1500000
## 2    2021            Al Durham 2.4000000
## 3    2020      Armaan Franklin 0.7000000
## 4    2021      Armaan Franklin 1.8666667
## 5    2020        Devonte Green 1.8000000
## 6    2020        Jerome Hunter 0.4210526
## 7    2021        Jerome Hunter 0.6111111
## 8    2020           Joey Brunk 0.4000000
## 9    2020         Justin Smith 1.0000000
## 10   2020        Race Thompson 0.2941176
## 11   2021        Race Thompson 1.3000000
## 12   2020         Rob Phinisee 3.3157895
## 13   2021         Rob Phinisee 3.0500000
## 14   2020 Trayce Jackson-Davis 1.3500000
## 15   2021 Trayce Jackson-Davis 1.5000000
## 16   2021        Trey Galloway 1.5000000

Of the 10 players listed, four only had data available for one season. Thus, six players could be compared from one season to the next. In doing this, the analyst discovered that there was only one players (of the six) who actually saw a decrease in their in-conference AST between the 2019-20 season and the 2020-21 season. Looking at the chart, it was found to be: Rob Phinisee. However, understand and proceed with caution as some players experienced an injury in the latter season.

Now, the same was evaluated for Coach Woodson.

aggregate(cbind(AST) ~ season + player, data = playerMW_conf15, FUN = mean, na.rm = TRUE)

##    season               player       AST
## 1    2023  Jalen Hood-Schifino 3.4444444
## 2    2022          Miller Kopp 1.0434783
## 3    2023          Miller Kopp 1.1000000
## 4    2022       Parker Stewart 1.1363636
## 5    2022        Race Thompson 1.2173913
## 6    2023        Race Thompson 0.8000000
## 7    2022         Rob Phinisee 1.6250000
## 8    2022          Tamar Bates 0.4761905
## 9    2023          Tamar Bates 0.9000000
## 10   2022 Trayce Jackson-Davis 1.8695652
## 11   2023 Trayce Jackson-Davis 4.5500000
## 12   2022        Trey Galloway 2.1333333
## 13   2023        Trey Galloway 1.8000000
## 14   2022       Xavier Johnson 5.2727273
## 15   2023       Xavier Johnson 4.0000000

Of the 9 players listed, three only had data available for one season. Thus, six players could be compared from one season to the next. In doing this, the analyst discovered that there were three players (of the six) who actually saw a decrease in their in-conference AST between the 2021-22 season and the 2022-23 season. Looking at the chart, they were found to be: Race Thompson, Trey Galloway, and Xavier Johnson. However, understand and proceed with caution as some players experienced an injury in the latter season. While these findings may not be directly due to coaching, it is one aspect to consider when evaluating the coach.

REB

REB

Since rebounds do not have the ability to be undefined, the analyst just needed to find the averages in a variety of ways. First find the mean of each position by coach. The table below shows information for Coach Miller.

aggregate(cbind(REB) ~ position, data = playerAM_conf15, FUN = mean, na.rm = TRUE)

##   position      REB
## 1        F 5.318182
## 2        G 2.342105

Then, the same was done for Coach Woodson.

aggregate(cbind(REB) ~ position, data = playerMW_conf15, FUN = mean, na.rm = TRUE)

##   position      REB
## 1        F 6.241935
## 2        G 2.358974

As shown above, Woodson’s forwards had a higher average in rebounds than Miller’s. However, the guards between both coaches appear to contribute about the same number of rebounds per conference game. Now, the analyst looked to take this a step further by looking at position by season.

aggregate(cbind(REB) ~ season + position, data = playerAM_conf15, FUN = mean, na.rm = TRUE)

##   season position      REB
## 1   2020        F 4.802083
## 2   2021        F 6.172414
## 3   2020        G 2.126582
## 4   2021        G 2.575342

Both positions for Miller saw an increase over the two year span. Now, see if the same holds true for Woodson.

aggregate(cbind(REB) ~ season + position, data = playerMW_conf15, FUN = mean, na.rm = TRUE)

##   season position      REB
## 1   2022        F 6.000000
## 2   2023        F 6.545455
## 3   2022        G 2.281250
## 4   2023        G 2.483333

Similarly for Woodson, both positions also saw growth in the number of rebounds in conference games over his two years as coach. Next, the analyst wanted to look into individual player growth. First, look at Archie Miller’s player development.

aggregate(cbind(REB) ~ season + player, data = playerAM_conf15, FUN = mean, na.rm = TRUE)

##    season               player      REB
## 1    2020            Al Durham 2.000000
## 2    2021            Al Durham 2.600000
## 3    2020      Armaan Franklin 1.000000
## 4    2021      Armaan Franklin 3.666667
## 5    2020        Devonte Green 2.950000
## 6    2020        Jerome Hunter 2.052632
## 7    2021        Jerome Hunter 3.111111
## 8    2020           Joey Brunk 4.800000
## 9    2020         Justin Smith 5.000000
## 10   2020        Race Thompson 4.058824
## 11   2021        Race Thompson 5.950000
## 12   2020         Rob Phinisee 2.578947
## 13   2021         Rob Phinisee 2.500000
## 14   2020 Trayce Jackson-Davis 7.850000
## 15   2021 Trayce Jackson-Davis 9.150000
## 16   2021        Trey Galloway 1.722222

Of the 10 players listed, four only had data available for one season. Thus, six players could be compared from one season to the next. In doing this, the analyst discovered that there was only one players (of the six) who actually saw a decrease in their in-conference REB between the 2019-20 season and the 2020-21 season. Looking at the chart, it was found to be: Rob Phinisee. However, understand and proceed with caution as some players experienced an injury in the latter season.

Now, the same was evaluated for Coach Woodson.

aggregate(cbind(REB) ~ season + player, data = playerMW_conf15, FUN = mean, na.rm = TRUE)

##    season               player       REB
## 1    2023  Jalen Hood-Schifino  3.777778
## 2    2022          Miller Kopp  2.391304
## 3    2023          Miller Kopp  2.800000
## 4    2022       Parker Stewart  2.181818
## 5    2022        Race Thompson  7.739130
## 6    2023        Race Thompson  4.000000
## 7    2022         Rob Phinisee  2.062500
## 8    2022          Tamar Bates  1.095238
## 9    2023          Tamar Bates  1.450000
## 10   2022 Trayce Jackson-Davis  7.869565
## 11   2023 Trayce Jackson-Davis 12.200000
## 12   2022        Trey Galloway  1.733333
## 13   2023        Trey Galloway  2.500000
## 14   2022       Xavier Johnson  4.045455
## 15   2023       Xavier Johnson  1.000000

Of the 9 players listed, three only had data available for one season. Thus, six players could be compared from one season to the next. In doing this, the analyst discovered that there were two players (of the six) who actually saw a decrease in their in-conference REB between the 2021-22 season and the 2022-23 season. Looking at the chart, they were found to be: Race Thompson and Xavier Johnson. However, understand and proceed with caution as some players experienced an injury in the latter season. While these findings may not be directly due to coaching, it is one aspect to consider when evaluating the coach.

Overall Conclusions

It is clear that in most of the categories explored, Woodson’s averages outdid Miller’s. Even more important, from one year to the next, the analyst saw improvement for Woodson and his team in most of the same categories. Considering the teams past four seasons, especially in conference play, improvement is necessary for the Hoosiers to return to overall success. In terms of individual player development, conclusions are harder to come by. Both coaches typically saw improvement over the course of a year in each category amongst their players. However, with an increase in experience and the risk of injury it is hard to say the underlying factor to that improvement is the coach.

Insights Gained

Recruiting Insights

In the past 4 years both coaches recruited similarly. In fact, both coaches:

Kept the roster size at 17 players the past season.
Do not tend to recruit centers. Have a larger emphasis on recruiting forwards and guards.
Recruited many players out of Indiana and neighboring states, but the data alluded to recruitment all across the country.
Use height as factor in recruitment since there is an advantage in the game of basketball.
- Centers on average were the tallest at approximately 6’10”.
- Guards on average were the smallest at approximately 6’4”.

The only difference in the coaches’ recruitment came from the average height of guards which increased an inch (from 6’6” to 6’7”) when Mike Woodson replaced Archie Miller

Coach Performance Insights

Overall Win/Loss Insights

Archie Miller’s record was found to be 32-27, which posts a winning percentage of 54.2%. While Mike Woodson has increased the Hoosiers record to 42-24 the last two seasons, posting a winning percentage of 63.6%. A near 10% increase in winning percentage.
Archie Miller’s first year was better than Mike Woodson’s. However, the opposite was true of the successive year. Miller saw an 18.1% decrease in wins, while Woodson saw a 7.7% increase.
Woodson posted the highest mark of winning percentage in a season at 67.7% in the 2022-23 season. Increasing the year prior by 7.7%.
In general, the Hoosiers have seen improvement under the guidance of Mike Woodson regardless of game location. They are being effective and finding more ways to win on the road.
- Both coaches have a better record at home than at any other location.
  - Coach Woodson has more wins and less losses than Archie Miller does when playing at home (for the time span of 2 seasons)
- Both coaches struggle to win away games.
  - The number of losses for both Woodson and Miller are identical, however the number of wins achieved by Woodson in two years surpasses the number of wins achieved by Miller in the past two seasons.
Both coaches hold positive score differentials, which means on average their teams are outscoring their opponents.
- Mike Woodson’s teams outperform Archie Miller’s by an average score differential value of 3.

Conference Win/Loss Insights

Miller had a losing record against 8 of the 13 conference teams. Including, 5 teams that remained unbeaten in his last two seasons as head coach.
Woodson posted a losing record against 6 of the 13 teams. Including, just two teams that remain unbeaten.
Miller had a winning record against 5 of the 13 conference teams.
Coach Woodson had a winning record competing against 7 out of the 13 of the conference teams.
Archie Miller’s teams had a negative average score differential against 9/13 teams in the conference and 1 team with a score differential of 0.
Mike Woodson’s teams had a negative average score differential against just 6/13 teams.
Archie Miller’s teams had a positive average score differential against just 3/13 teams, which are all teams that he posted a perfect record against.
Mike Woodson posted a positive differential against 7/13 teams.

In-Conference Team Performance Insights

FGP- Mike Woodson’s two-year average is better than Archie Miller’s by more than 3%.
- Mike Woodson posted better numbers than Archie Miller every season. Further, Mike Woodson’s teams saw improvement from one season to the next going from 44% in 2021-22 to 46.9% in 2022-23.
STL- Archie Miller’s teams posted a better average for steals than Mike Woodson’s.
- Archie Miller increased the average number of steals in his two-year span. On the other hand, Mike Woodson saw a decrease in steals during his second year as coach.
FTP- Mike Woodson’s teams free throw percentages are above the averages of Archie Miller’s teams by an average of about 5%.
- Mike Woodson saw an increase in free throw percentage over his two seasons as head coach.
Both coaches lead the team to around the same best and worst average field goal percentages by month, but the timing of each is different.
- Miller’s last two season teams saw an overall decrease in field goal percentage over the course of the season, while Woodson’s team generally saw an increase.
10/13 conference opponents saw better team averages in more game components (OREB, DREB, AST, STL, TO, FGP, and ThreePTP) from Woodson’s first two years in comparison to Archie Miller’s last two years.
- The Hoosiers under Woodson outranked the Hoosiers under Miller in average field goal percentage against 11/13 of their conference opponents
- Between Woodson’s first and second year as head coach and found that each field goal percentage and three-point percentage saw improvement 8/13 times for their conference opponents.
Using per position analysis, it is clear that in most of the categories explored Woodson’s teams averaged more than Miller’s teams

Individual player development, conclusions were harder to come by. Both coaches typically saw improvement over the course of a year in each category amongst their players. However, with an increase in experience and the risk of injury it is hard to say the underlying factor to that improvement is the coach.

Project Limitations

While the analysis done is thorough, there are shortcomings and limitation to the study conducted. A few have been identified and included below:

For example, more could have been learned if additional data was to be provided. Some ideas for this will be included in the Next Steps tab, but one example of this is including information on all of the years that Archie Miller was head coach. This would give a more overt and complete picture of his entire coaching career as opposed to the team performance in his last two seasons. In fact, it may show his coaching career at IU was more successful than the data set worked with alluded to.
In addition, it is important to notice that most of the analysis was done using descriptive analytics. While this was great in identifying what had happened for the Hoosiers in the past 4 seasons, it does not allow the analyst to project would could happen in the future. Also by using primarily descriptive analytics the findings in this study cannot be generalized to a different team or across a different set of coaches. In other words, the code could be recycled but the analysis would have to be redone.
Lastly, this project had certain time constraints that it needed to follow. Due to the time constraints, as with any study, the study is not as thorough as it could have been. If more time was available the data could have been looked at in different ways and the inclusion of the game logs could have elevated the study to the next level.

Regardless of these limitations, this project still appears to start a good foundation for generating a general recruiting profile and evaluating a team’s performance under a given head coach.

Next Steps

To ensure continued growth and/or success for the program, continued evaluation of the team’s performance under the current acting head coach should be done regularly. One way that this can be done is to monitor the performance rates over the course of each season and report the findings annually. As the team gets better it is unlikely to see as much improvement as was present in this study, thus continued research could include but is not limited too:

A larger focus on the team’s strengths or weaknesses as opposed to comparison of averages by year. Identification of these areas can help the coaches play into their teams strengths providing coaches with data driven strategies and game plans. In addition, identification of weaknesses can help the coach focus on this area of play when preparing for the next season so it is not such a detriment to the team.
Inclusion of the game logs to derive useful in game trends. This could allow the coaches to see which strategies were effective in the game and which strategies did not provide evidence of working. This could also provide better insights on individual performances in specific moments of the game. Note, that this analysis could also be done in regards to the team’s opponents so that valuable scouting information would also be available to improve overall ream performance.
Using NCAA DI Men’s basketball data to find benchmarks that denote a successful team. This could be done by aggregating season averages for teams that made the NCAA tournament the past season(s) and then comparing these results to that of the team’s most recent or current season. This method proves beneficial as it provides context for how the Hoosiers did in comparison to other (successful) teams as opposed to a comparison of last years team. As stated above it also gives the coaches an expectation on exactly how much improvement needs to take place to take their team to the next level.

Indiana Basketball

Kamrie Foster

2023-05-01

Indiana Men’s Basketball Coach Analysis

Report Summary

Report Summary

Summary of the Problem Statement

Summary of the Problem Statement

Summary of the Data and Methodology

Summary of the Data and Methodology

Athletic Director Implications

Athletic Director Implications

Business Background

Business Background

General Information and History

General Informaion and History

Business Problem

Business Problem

Key Stakeholders

Key Stakeholder

Analytical Approach

Analytcial Approach

Expected Benefits

Expected Benefits

Data Preparation

Data Preparation

Data Understanding

Indiana Rosters

Indiana Schedules

Indiana Box Scores

Data Cleaning

Indiana Rosters

Indiana Schedules

Indiana Box Scores

Libraries Needed for Analysis

Libraries Needed for Analysis

Roster Analysis

Roster Analysis

Roster Size

Roster Size

Players by Position

Players by Position

Geographical Information

Geographical Information

Player Height

Player Height

Schedule Analysis

Schedule Analysis

Overall Win/Losses

Overall Win/Losses

Win/Loss Record by Season

Win/Loss Record by Season

Overall Win/Losses by Location

Win/Losses by Location

Seasonal Win/Losses by Location

Seasonal Win/Losses by Location

Score Average Comparison

Score Differential Time Series Information

Score Differetial Time Series Information

Specific Opponent Information

Specific Opponent Information

Conference Game Outcomes

Conference Game Outcomes

Conference Score Differential Time Series

Conference Score Differential Time Series

Conference Score Differential by Opponent

Conference Score Differential by Opponent

Box Score Analysis

Box Score Analysis

Important Variables for Game Outcome by Team

Important Variables for Game Outcome

Manipulating the Data Before Model Creation

Filtering Box Score Data to Include Team Information

Logistic Regression

Logistic Regression

Classification Trees

Classification Trees

Random Forest

Random Forest

Important Variable Conclusion