The Indiana University Athletics Department previously commissioned a study to evaluate the performance of the men’s basketball program. Researchers found concerns in relation to the success the basketball program has posted in the past, to the performance of the program in its most recent years. Given the programs successful history, the limited number of NCAA tournament appearances and lack of conference championships are reason enough to question the direction of the program. With this being said, the athletic department feels strongly that the program needs to see improvement quickly. Thus, the athletic director for Indiana University has recommended a thorough data-driven evaluation of the team’s performance (in the past four seasons) in relation to the coaching change. The athletic director wants to know if there have been improvements in performance to see if the new coach is the right fit for the program. This evaluation includes the current head coach, Mike Woodson, as well as the last head coach, Archie Miller. Understand that Woodson took over for Miller, who had been acting head coach since March of 2017, in March of 2021. While Archie Miller’s coaching career spans from the 2017-18 season to the 2020-21 season, this study only includes Miller’s last two seasons as head coach. Given this the aim of this study is twofold:
The research conducted was based on public NCAA Hoop data (supplied by the NCAA Hoop Repository) spanning from the 2019-20 seasons to the most recent 2022-23 season. The data was given by year for the team’s roster as well as by year for the schedule. Each year was then combined to create one large data file containing all players on the roster over the four seasons being analyzed. The same was done for the schedules, combining each year to create a list of all games scheduled over that same four-season period. However, the box score data was given by game. Thus, within each season each game had to be imported and then each season to create the box scores for all games for the past four seasons. Aside from accumulating the data in one place, it was well structured. Thus, data cleaning was not overly extensive, and the larger focus was on data manipulation throughout the study. Each manipulation is provided with a general description as well as explanation so that there is a general understanding on what took place.
Recall that this study was largely based on descriptive analytics (with the usage of data consolidation, bar graphs, tables, and line graphs) to come to many of the conclusions that will be discussed below. While limited, the use predictive analytics was also used. This was done to identify important variables in game outcome, as wins are one metric used to define overall success. Accordingly, the analyst aimed to find key components of the game (using logistic regression, classification trees, and random forests) that had the most influence on game outcome. From these results the analyst then compared the identified components across the different coaches as well as by year to see if improvement had taken place.
To conclude, analysis indicates that the Indiana University Men’s Basketball team has seen a considerable amount of improvement since the coaching change and even within the past two seasons (with Mike Woodson as head coach). The improvement is easily seen in the number of wins and the posted winning percentages over the course of the entire season, as well as within conference competition. As stated before, wins are one mark of success for sports, however notable improvement can be seen in several areas of play for the Hoosiers (both between coaches and successive seasons) as well. Thus, evaluation of Woodson results in asking him to return next season as head coach. Evidence supports that he could lead the program to more success next season.
The analysis also indicates that the recruiting styles of the coaches are nearly identical. Both coaches held the roster to the same size. In addition, there appears to be a larger number of opportunities available for both forwards and guards with respective heights of at least 6’4” and 6’6”. In addition, neither coach appears to have restrictions on recruiting players from any part of the country. Thus, any DI caliber athlete should have an opportunity to play in Bloomington.
The Indiana University Athletics Department is looking to discover new information on the development of program performance in relation to their men’s basketball team. The Hoosiers are a NCAA Division I college basketball program out of Bloomington, Indiana and serve as a member of the Big Ten Conference. Notably, the men’s basketball program is amongst one of the most successful programs in NCAA history. The Hoosiers have won a total of five NCAA Championships, sitting only behind UCLA (12), Kentucky (8), and North Carolina (6) (also notable is Duke tied at 5). Amongst these five NCAA championships, the Hoosiers also hold the title as the last undefeated national championship team (occurring in 1976). The historical success of the program does not end there as an additional achievement for the Hoosiers is their NCAA tournament runners-up finish in 2002. Aside from national titles, the program has also amassed 22 Big Ten Conference Championships and 40 NCAA tournament appearances. The Hoosiers have also remained a relevant, high performing basketball program with their 31 preseason appearances in the AP Poll and 28 appearances in the Final AP Poll. These ranked appearances accumulating to an estimated total of 574 total weeks spent in the AP Poll. Although Indiana has had a history as a successful program, in recent seasons the team has struggled some and there are over-arching concerns that the program has taken steps back since its last national title run in 2002. Thus, with the hiring of new head Mike Woodson (as of March 28, 2021), the Indiana University athletic director has hired a sports analysis team to evaluate the team’s performance under the new coach to ensure this coaching change is leading the program in the right direction.
Indiana Athletics Mission Statement
“The spirit of Indiana in athletics must be the spirit of the team. The team must be competitive in spirit and have the will to win over and above the will to star… Without the spirit of the team and this goal of school above self, we fail miserably—not only here in our sports life, but in the world of business and society after we leave this campus… With it, we exemplify the true spirit of Indiana athletics.”
In any sport, the coaching staff is largely charged with the program’s performance outcomes. As stated above, Indiana basketball has a strong history of successful teams. However, the team’s performance in more current years has not lived up to the high standards forged in the past. Of course, the team has had notable accomplishments in the last 5 years, but the low points and mediocrity of the team seem to outweigh most of the positives. Thus, with the increased interest and use of analytics in sports, the athletic director has added additional staff to the team to evaluate the overall performance of the team in relation to the two different head coaches (Archie Miller and Mike Woodson) within the past four seasons. The additional staff members will look for patterns in play over the course of the seasons (in relation to each coach) to see if improvement is occurring for the program as expected. One of the two coaches being looked at is Archie Miller, who was named the 29th head coach of the Indiana Men’s Basketball program on March 25, 2017. He served as head coach for the Hoosiers until March 28, 2021. This is when the head coach in question, Mike Woodson, took over leading the Hoosiers. Understand that this analysis will look at the performance patterns occurring at the tail-end of Miller’s coaching career and compare them to the performance patterns occurring at the beginning of Woodson’s coaching career at Indiana. Since recently the Hoosiers have struggled so much, the athletic department does not want to waste any time and needs to see the program improve, hence why this analysis is taking place (so that coaching changes can occur if they need to, as quickly as possible).
All of the data for this project will be sourced from a public database known as the NCAA Hoop Dataset.
Ideally, results from this study will:
By looking at performance, the athletic director can make a well-informed decision on whether the new head coach appears to be the right fit for this program. Additionally, from this analysis the coach will be able to take active and controllable steps to self-improvements within his team by learning about their tendencies and shortcomings.
With all of the business background components in mind, a list of key stakeholders has been generated below:
Note the stakeholders are not limited to those listed above, but these will be the people/organizations/business most directly impacted.
With the data available, the research and development team plan to look at the following data tables: Box Score Reports, Team Schedule, and Team Roster. For more information on these tables visit the data preparation phase. Majority of the analysis will delve deeper into the Hoosiers box scores and schedule, while supporting information will come from roster information. Note that most of the analysis will be done using descriptive analytics, which identifies what has already happened. The analyst will be looking at the schedule and historical box scores to gain insight on performance patterns of players and the team in two-year spans. The analyst will compare game to game performances of the team as well as consolidate these statistics for each coach (over the course of their respective seasons as head coach) over the last four years. The focus of the analyst will be on 5-7 performance variables that will be identified by their importance in relation to game outcome. All variables included in this study will be explained in more detail in the data understanding phase. By narrowing the focus, the analyst will have a more thorough and complete understanding of performances in areas of the game deemed most important. While descriptive analytics is not the only type of analytics that could be used, it will be the most insightful for the discoveries the analyst needs to make.
Analysis should provide a foundation for the athletic director to confidently state whether the current coach is a good candidate to continue to move the program in the right direction. There is no guarantee that analysis will bring national or conference titles to the Hoosiers, as individual game outcomes can be highly unpredictable. However, general marks of improvement should provide evidence of the formation of a more competitive and well-rounded team. Better stated, a team that highlights their programs’ historical attributes. The analyst looks to discover the strengths currently in the program, as well as identify weaknesses so that the coach can work on them, so they are no longer a large detriment to the team. In the identification of the details of the program, the staff and players will have a greater more objective understanding of themselves that will promote further competition and success. Other parts of the university will benefit from this too. The athletic department will benefit by continuing to put Indiana University on the map as a strong athletic program and drawing in potential recruits for basketball as well as other sports. The trickle-down effect of strong athletic programs will likely bring more students to campus due to the high interest of sports across the country. This will both impact the admissions office as well as the faculty/staff at Indiana University. This trickle-down effect would continue as population growth in Bloomington will likely have a positive effect on the economy (more people visiting, more people spending, etc.). Finally, the NBA could benefit from this by having player performance information on potential draft picks from this given university. Of course, if this proves to be successful at Indiana University, other colleges would want to be a part of this too.
Origins of the Data Set
This NCAA Men’s Hoops data can be found at the hyperlink here.
NCAAhoopR is an R package for working with NCAA basketball play-by-play data. It automatically scrapes play-by-play data and returns it to the user in a more tidy, organized, and concise format. This provides an analyst the capability to analyze the data in the ways that best suits the research being conducted. The data has no specific original purpose other than to be readily available to do basketball analytics on. The men’s college basketball data spans from the 2005-2006 season to the most current 2022-2023 season.
Description of the Data Set
Box Scores- Includes the game recorded box-scores from each game during the season. Includes information on individual performances as well as team totals for the game(s). Variables include player name, minutes played, field goal information, three-point shot information, free throw information, rebounds, assists, steals, blocks, turnovers, personal fouls, overall points, and whether the player was a starter or not. This table will likely be the “core” table for the analysis, as it hosts the easily and readily available data on the game.
Pbp_logs- Includes play-by-play information on each game recorded. This table includes the time of the play occurrence, description of the play, score, win probability throughout the game, time outs remaining, and information on who has possession of the ball.
Rosters- Team rosters for each season. Includes information on each player. These consist of position, class, height, weight, and hometown/home state.
Schedules- A team’s schedule for a given season. Displays information on the teams played, the dates they were played, as well as location, score, and the team of interests’ record (both regular and conference).
Note: These descriptions are geared more toward a general description. More specific details about the tables in terms of the analysis to be done will be discussed in the tabs below.
Data Understanding
The analysis team will only be using data sets from box_scores, rosters, and schedules for the Indiana basketball team. The data will be combined across four seasons spanning from 2019-2020 through the current 2022-2023 season. Information regarding each of the tables being used is included below. This also includes information on all of the variables present in the tables along with a description:
Indiana Rosters
Indiana Schedules
Indiana Box Scores
Data Cleaning
First all of the data sets were read into R.
library(readxl)
IU_Rosters <- read_excel("/Users/kamriefoster/Downloads/Indiana Rosters.xlsx")
IU_Schedule <- read_excel("/Users/kamriefoster/Downloads/Indiana Schedule.xlsx")
IU_BoxScores <- read_excel("/Users/kamriefoster/Downloads/Indiana Box Scores.xlsx")
Most of the data cleaning steps were done in excel before they were imported. Thus, a list of these cleaning steps has been included as a list for each data table in the tabs below.
Note: there may be additional data manipulation when analysis begins. These were not included as each of these steps were derived for a specific purpose that will be outlined. The steps below were done to clean the data set as a whole and serves as the starting point for any manipulation that happens from this point forward.
Indiana Rosters
After these steps were done in excel a few data cleaning steps occurred in R. There were two steps of column removal. First, the column number was removed as it was just a representation of what jersey number each player wore and provides no insight to this analysis. Then, the created variable player_id was removed. Although this seems like extra work to include the variable to start with, the analyst did not know whether the athletic department would share the findings and want to uphold player confidentiality. If player confidentiality is the goal this column should not be removed and the column labeled player would have been. These two removals are displayed below:
IU_Rosters <- IU_Rosters[,-2] #removal of number
IU_Rosters <- IU_Rosters[,-2] #removal of player_id
IU_Rosters
## # A tibble: 67 × 8
## season name position height weight class hometown homestate
## <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 2020 Cooper Bybee G 73 185 JR Ellettsvi… IN
## 2 2020 Al Durham G 76 185 JR Lilburn GA
## 3 2020 Armaan Franklin G 76 195 FR Indianapo… IN
## 4 2020 Justin Smith F 79 230 JR Buffalo G… IL
## 5 2020 Trayce Jackson-Davis F 81 245 FR Greenwood IN
## 6 2020 Michael Shipp G 75 185 FR Cincinnati OH
## 7 2020 Rob Phinisee G 73 190 SO Lafayette IN
## 8 2020 Devonte Green G 75 185 SR North Bab… NY
## 9 2020 Nathan Childress F 78 210 FR Zionsville IN
## 10 2020 Adrian Chapman G 74 190 SR Brownsburg IN
## # … with 57 more rows
Indiana Schedules
Indiana Box Scores
The analyst also created a few calculated fields:
More details and explanation of these variables and their creation will be included in the analysis portion they are created in. Note that these could not accurately be produced in R as there were undefined values that gave the analyst NA’s and produced inaccuracies.
After these steps were done in excel, the analyst then had to do some variable removal as was done in the Roster data set. First, the analyst removed the variable team from the data set. This variable would be beneficial if the analysis was being conducted on more than one team, but with Indiana being the only team it is useless. Then the analyst removed the game_id column. Originally the analyst did not know whether the analysis would need to link data tables which is why it was created in the first place. However, given the time constraints of the project the analyst decided to not do this and thus, the variable was removed. Similarly, as discussed before in the rosters section the variable player_id was deleted. Once again, if the athletic department wished to keep player information confidential this would allow this code to be reconfigured to keep that safe. However, for this particular analysis this was not necessary and thus the analyst removed the player_id column. These removals are demonstrated in the code below:
IU_BoxScores <-IU_BoxScores[,-1] #removal of the variable team
IU_BoxScores <- IU_BoxScores[,-2] #removal of game_id
IU_BoxScores <- IU_BoxScores[,-2] #removal of player_id
After data cleaning was accomplished, the analyst then looked to load all of libraries that would be needed to complete the analysis. The list and their respictive reasoning for being loaded are included below:
library(readxl) - this package is used to read the
data sets into R.
library(tidyverse) - this package is used to be able to
manipulate the data. Can be used to filter, join, and group the
data.
library(ggplot2) - used to be able to create bargraphs
and other visualizations.
library(rpart) - used to be able to generate the
classification tree.
library(rpart.plot) - used to create the visual of the
classification tree.
library(randomForest) - used to generate the random
forest model.
library(dplyr) - used to be able to filter and
manipulate the data as needed.
library(cowplot) - used to enter pictures in a grid for
easy comparison of models. library(ggplot2) - used to
enter picture in a grid for easy comparison of models.
library(magick) - used to enter picture in a grid for
easy comparison fo models.
Then, the libraries were actually loaded into R using the code below:
#libraries needed for analysis
library(readxl)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.4 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
library(rpart)
library(rpart.plot)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(dplyr)
library(cowplot)
library(ggplot2)
library(magick)
## Linking to ImageMagick 6.9.12.3
## Enabled features: cairo, fontconfig, freetype, heic, lcms, pango, raw, rsvg, webp
## Disabled features: fftw, ghostscript, x11
First, creating data sets that include information about the rosters by each individual season. This is done by filtering the data as shown below:
#looking at the IU Roster by each season
#recall that 2020 is the 2019-2020 season, 2021 is the 2020-2021 season,
#2022 is the 2021-2022 season, and 2023 is the most recent 2022-2023 season.
IU_Roster2020 <- filter(IU_Rosters, season == 2020)
IU_Roster2021 <- filter(IU_Rosters, season == 2021)
IU_Roster2022 <- filter(IU_Rosters, season == 2022)
IU_Roster2023 <- filter(IU_Rosters, season == 2023)
The analyst wanted to start by getting an idea of the number of players the Hoosiers carry on their roster each season. The analyst set out to find if the new head coach was any different than the last coach in relation to number of players on the season per year.
count_players <- table(IU_Rosters$season)
barplot(count_players, main = "Size of Roster", xlab = "Season", ylim = c(0,20))
As shown in the figure above, the number of players on the team has stayed consistent the last few years at 16 and 17. It is safe to assume that the roster will stay around the same size due to scholarships and NCAA regulations. In men’s college sports it is common for incoming freshman to be red shirted in order to have an easier transition from high school to college and/or to develop their athletic abilities. In some cases players may be red shirted to give them time to recover from a pre-season injury. Thus, ensuring that he will have enough players to make it through the season.
After evaluating the number of players on the team, the analyst then wanted to gather information on the number of players within each position. By focusing on position, important recruiting information can be derived. Bar charts have been created below to display the comparison of number of players within each position.
#used to show all four graphs in one visual for easy comparison
par(mfrow = c(2,2))
#the bar graphs are made below for each season
count_class2020 <- table(IU_Roster2020$position)
barplot(count_class2020, main = "2020 Position Count", xlab = "Position", ylim = c(0,12))
count_class2021 <- table(IU_Roster2021$position)
barplot(count_class2021, main = "2021 Position Count", xlab = "Position", ylim = c(0,12))
count_class2022 <- table(IU_Roster2022$position)
barplot(count_class2022, main = "2022 Position Count", xlab = "Position", ylim = c(0,12))
count_class2023 <- table(IU_Roster2023$position)
barplot(count_class2023, main = "2023 Position Count", xlab = "Position", ylim = c(0,12))
One immediate observation the analyst had was the lack of centers on the team. In the 2019-2020 season, the Hoosiers did not have a single guy on the roster listed as the center position. Even in the remaining seasons, the Hoosiers consistently had a very low number of players listed as centers within their roster. This could largely be due to the fact that teams are looking for players who can be effective in the paint as well as the perimeter. Specifically between the 2021-2022 and 2022-2023 season, it appears that Woodson put emphasis to recruit more forwards to the team. In general, these charts show that regardless of coach the Hoosiers appear to have a larger number of opportunities available for guards and forwards.
To collect more general information on recruitment of the Indiana basketball team, the analyst then looked at where each player in the roster was from. First, looking at the seasons that Archie Miller was coach (2019-20 and 2020-21), to see if there were any major differences from when Mike Woodson took over (2021-22 and 2022-23). This was done in Tableau using the geographical feature. Screenshots for comparison have been included below:
Archie Miller’s recruiting years are displayed on the top, while Mike Woodson’s recruiting years are on the bottom. These diagram shows that over the past 4 seasons there have been players from all over the country. Both visuals also appear to have a large number of recruits from the home state Indiana. However, it does not appear that either coach was or is limiting their recruitment to be outside of any range. In fact, it appears that Coach Mike Woodson has expanded his recruitment further than Archie Miller did with player additions from Texas, Kansas, Missouri, and Virginia. Recruitment efforts for Archie Miller appear to be more densly populated in Indiana and in neighboring states. Mike Woodson still has many players from the same area, but it is clear that the data points are less dense in and around Indiana. In short, it appears that any Division I caliber athlete will have an opportunity to play in Bloomington regardless of where they are from. Specifically under Mike Woodson, there appears to be a larger effort to recruit high prospects even if they are across the country.
Next, the analyst gathered more specific information about the team. In basketball, height is a huge component of the game. There is and always has been a height advantage for teams in terms of shooting, blocking, and rebounding. With this in mind, the average height of the team per each coach was found.
Filtering the data for seasons Archie Miller was the head coach. Recall that this particular data set only has the past four seasons included, thus this filtering will only include roster information from the 2019-20 season as well as the 2020-21.
IU_RosterAM <- filter(IU_Rosters, season == "2020" | season == "2021")
Filtering the data for seasons Mike Woodson has been the head coach. This includes the 2021-22 season as well as the 2022-23 season.
IU_RosterMW <- filter(IU_Rosters, season == "2022" | season == "2023")
Looking at the average height by position for each coach. First start by grouping the data by position:
AM_by_position <- group_by(IU_RosterAM, position)
MW_by_position <- group_by(IU_RosterMW, position)
Now find the average height by position:
AM_avg_height_by_pos <- summarise(AM_by_position, type = mean(height, na.rm = TRUE))
AM_avg_height_by_pos
## # A tibble: 3 × 2
## position type
## <chr> <dbl>
## 1 C 83
## 2 F 79.8
## 3 G 74.8
MW_avg_height_by_pos <- summarise(MW_by_position, type = mean(height, na.rm = TRUE))
MW_avg_height_by_pos
## # A tibble: 3 × 2
## position type
## <chr> <dbl>
## 1 C 82.3
## 2 F 79.4
## 3 G 76
As seen above, in terms of height, there is nearly no difference in recruiting between the two coaches. The only position that contributed over an inch change was the forwards with Mike Woodson increasing the average by just over an inch. Understand that the data is limited and with more years per each coach a larger difference could be seen. However, in regards to the data available the following observations can be made for the Hoosiers in regards to the last 4 seasons:
Since recruitment has been similar in the past four seasons a summary was generated to see both the smallest and the tallest players on the team. This allows the analyst to continue to build a general recruiting profile to give an idea of preferred player size of the Hoosiers.
summary(IU_Rosters)
## season name position height
## Min. :2020 Length:67 Length:67 Min. :73.00
## 1st Qu.:2021 Class :character Class :character 1st Qu.:75.00
## Median :2022 Mode :character Mode :character Median :77.00
## Mean :2022 Mean :77.43
## 3rd Qu.:2022 3rd Qu.:79.00
## Max. :2023 Max. :84.00
## weight class hometown homestate
## Min. :185.0 Length:67 Length:67 Length:67
## 1st Qu.:189.0 Class :character Class :character Class :character
## Median :205.0 Mode :character Mode :character Mode :character
## Mean :209.3
## 3rd Qu.:226.5
## Max. :255.0
The generated summary shows that the smallest player on the roster is listed at 6’1”. It is clear that the chances of being recruited to Indiana University increase as a player’s height increases, because of the advantage it brings to the game. However, a smaller player could be recruited if they demonstrate the appropriate skill level.
After developing a good understanding of the roster similarities and differences for Coach Miller and Coach Hood, the analyst looked into learning more about their overall success. For many coaches, success is denoted by the number of wins and the number of losses. With this in mind, the analyst looked at each coaches overall record (in their respective past 2 seasons). This was done by using the accumulated schedules of the team over the past 4 seasons and simply looking at the total number of wins and comparing them directly to the number of losses. Ideally, a team should have more wins than losses to be considered quality and competitive. In addition, Mike Woodson and the AD hope to see an increase in the number of overall wins in his last two seasons to allude that the program is becoming more successful under his direction.
First, start by recreating the respective schedules for each coach.
IU_ScheduleAM <- filter(IU_Schedule, season == "2020" | season == "2021")
IU_ScheduleAM
## # A tibble: 59 × 12
## season game_id date opponent location team_score opp_score
## <dbl> <dbl> <dttm> <chr> <chr> <dbl> <dbl>
## 1 2020 401166018 2019-11-05 00:00:00 Western I… H 98 65
## 2 2020 401166053 2019-11-09 00:00:00 Portland … H 85 74
## 3 2020 401166059 2019-11-12 00:00:00 North Ala… H 91 65
## 4 2020 401166073 2019-11-16 00:00:00 Troy H 100 62
## 5 2020 401166082 2019-11-20 00:00:00 Princeton H 79 54
## 6 2020 401166096 2019-11-25 00:00:00 Louisiana… H 88 75
## 7 2020 401166102 2019-11-30 00:00:00 South Dak… H 64 50
## 8 2020 401168234 2019-12-03 00:00:00 Florida S… H 80 64
## 9 2020 401166106 2019-12-07 00:00:00 Wisconsin A 64 84
## 10 2020 401169461 2019-12-10 00:00:00 UConn N 57 54
## # … with 49 more rows, and 5 more variables: score_dif <dbl>, outcome <chr>,
## # record <chr>, `conference record` <chr>, streak <chr>
IU_ScheduleMW <- filter(IU_Schedule, season == "2022" | season == "2023")
Now, generate bar graphs to compare the two coaches overall records.
#used to show all four graphs in one visual for easy comparison
par(mfrow = c(1,2))
#compute the counts for the number of wins and the number of losses.
count_outcomesAM <- table(IU_ScheduleAM$outcome)
#compute the counts for the number of wins and the number of losses.
count_outcomesMW <- table(IU_ScheduleMW$outcome)
barplot(count_outcomesAM, main = "Coach Miller Overall Wins versus Losses", ylim = c(0,50))
barplot(count_outcomesMW, main = "Coach Woodson Overall Wins versus Losses", ylim = c(0,50))
While in two seasons both coaches posted a winning percentage over 50%, it is clear in the visual that Mike Woodson has been more successful in his first two season than Archie Miller was in his last two seasons. To answer the question “How much more successful?” the analyst looked at the actual count for both coaches.
#win/loss total for Archie Miller's past two seasons
count_outcomesAM
##
## L W
## 27 32
#win/loss total for Mike Woodson's first two seasons
count_outcomesMW
##
## L W
## 24 42
Archie Miller’s record was found to be 32-27, which posts a winning percentage of 54.2%. While Mike Woodson has increased the Hoosiers record to 42-24 the last two seasons, posting a winning percentage of 63.6%. That is almost a 10% increase in the amount of wins the Hoosiers had. This is a promising sign that with Mike Woodson as head coach, the Hoosiers are headed in the right direction.
The analyst then wanted to look at how the Hoosiers did each season of play. While the overall record shows that the 2021-22 season and the 2022-23 season was overall better than the 2019-20 and 2020-21 seasons, analyzing by season will allow the analyst to see if the records are improving or declining with a particular coach leading the team. Start by filtering the schedule data by each given season.
IU_Schedule2020 <- filter(IU_Schedule, season == 2020)
IU_Schedule2021 <- filter(IU_Schedule, season == 2021)
IU_Schedule2022 <- filter(IU_Schedule, season == 2022)
IU_Schedule2023 <- filter(IU_Schedule, season == 2023)
Now, create the bar charts to compare the number of wins and losses with in each season of this recruiting cycle.
#used to show all four graphs in one visual for easy comparison
par(mfrow = c(2,2))
#the bar graphs are made below for each season
count_outcome2020 <- table(IU_Schedule2020$outcome)
barplot(count_outcome2020, main = "2019-20 Outcome Totals", xlab = "Outcome", ylim = c(0,25))
count_outcome2021 <- table(IU_Schedule2021$outcome)
barplot(count_outcome2021, main = "2020-21 Outcome Totals", xlab = "Outcome", ylim = c(0,25))
count_outcome2022 <- table(IU_Schedule2022$outcome)
barplot(count_outcome2022, main = "2021-22 Outcome Totals", xlab = "Outcome", ylim = c(0,25))
count_outcome2023 <- table(IU_Schedule2023$outcome)
barplot(count_outcome2023, main = "2022-23 Outcome Totals", xlab = "Outcome", ylim = c(0,25))
In the above visual, the two seasons that Archie Miller was coach are displayed on top and the two seasons that Mike Woodson was coach are displayed on the bottom. Observations from the bar graphs include:
Similar to before, the analyst wanted to compare win percentages across the seasons:
count_outcome2020
##
## L W
## 12 20
count_outcome2021
##
## L W
## 15 12
count_outcome2022
##
## L W
## 14 21
count_outcome2023
##
## L W
## 10 21
This information was placed in a table for easy comparison and organization.
In the data set that the analyst was working with, Archie Miller’s first year was better than Mike Woodson’s. However, the opposite was true of the successive year. Miller saw an 18.1% decrease in wins, while Woodson saw a 7.7% increase. Although Mike Woodson’s first year as head coach was not as successful (in terms of wins) as Archie Miller’s in this data set, Mike Woodson led the Hoosiers from their worst season (44.4%) to a near 15.6% increase in wins to 60%. He continued to push the program this past season with a 67.7% winning percentage. This beats any of the other seasons for the Hoosiers by nearly 5%. As such, there is evidence that Woodson has taken steps to put the basketball program in the right direction.
After gathering general information on seasonal wins/losses per each coach, the analyst aimed to find trend in outcomes for the team in relation to location of the game. As basketball is one of the main contributers for sports in revenue, it is important to pack the stadium with fans in order to sell tickets. Also, as nearly half of the games are played at different locations the analyst wanted to see if both coaches could be competitive when competing at another school. It is important to remember that game outcome is the underlying factor to success for the team, but other components of the game should be considered before making any outright conclusion on the coaches. It is understood that a team can play a “good” game and not have the outcome go their way. First, start by filtering the data accordingly:
#filtering the data by location for Archie Miller
IU_ScheduleAM_H <- filter(IU_ScheduleAM, location == "H")
IU_ScheduleAM_A <- filter(IU_ScheduleAM, location == "A")
IU_ScheduleAM_N <- filter(IU_ScheduleAM, location == "N")
#filtering the data by location for Mike Woodson
IU_ScheduleMW_H <- filter(IU_ScheduleMW, location == "H")
IU_ScheduleMW_A <- filter(IU_ScheduleMW, location == "A")
IU_ScheduleMW_N <- filter(IU_ScheduleMW, location == "N")
Now, display the differences in game outcome by location for each coach.
#used to show all six graphs in one visual for easy comparison
par(mfrow = c(3,2))
#the bar graphs for home games
count_outcomeAM_H <- table(IU_ScheduleAM_H$outcome)
barplot(count_outcomeAM_H, main = "Coach Miller Home Game Outcome Totals", xlab = "Outcome", ylim = c(0,30))
count_outcomeMW_H <- table(IU_ScheduleMW_H$outcome)
barplot(count_outcomeMW_H, main = "Coach Woodson Home Game Outcome Totals", xlab = "Outcome", ylim = c(0,30))
#the bar graphs for away games
count_outcomeAM_A <- table(IU_ScheduleAM_A$outcome)
barplot(count_outcomeAM_A, main = "Coach Miller Away Game Outcome Totals", xlab = "Outcome", ylim = c(0,30))
count_outcomeMW_A <- table(IU_ScheduleMW_A$outcome)
barplot(count_outcomeMW_A, main = "Coach Woodson Away Game Outcome Totals", xlab = "Outcome", ylim = c(0,30))
#the bar graphs for neutral games
count_outcomeAM_N <- table(IU_ScheduleAM_N$outcome)
barplot(count_outcomeAM_N, main = "Coach Miller Neutral Game Outcome Totals", xlab = "Outcome", ylim = c(0,30))
count_outcomeMW_N <- table(IU_ScheduleMW_N$outcome)
barplot(count_outcomeMW_N, main = "Coach Miller Neutral Game Outcome Totals", xlab = "Outcome", ylim = c(0,50))
From the visual the following observations can be made:
After drawing conclusions about location overall for each coach, the analyst started to look at the coaches individual seasons to compare them against each other. Note that neutral site games have been removed, since there isn’t enough data on hand. First, looking at the differences over the two years for Archie Miller. Start by filtering the data:
#2019-2020 Season
IU_Schedule2020_H <- filter(IU_Schedule2020, location == "H")
IU_Schedule2020_A <- filter(IU_Schedule2020, location == "A")
#2020-2021 Season
IU_Schedule2021_H <- filter(IU_Schedule2021, location == "H")
IU_Schedule2021_A <- filter(IU_Schedule2021, location == "A")
Now, create the bar graphs to compare each outcome location by season:
#used to show all four graphs in one visual for easy comparison
par(mfrow = c(2,2))
#bar graphs for home games
count_outcome2020_H <- table(IU_Schedule2020_H$outcome)
barplot(count_outcome2020_H, main = "2019-20 Home Game Outcome Totals", xlab = "Outcome", ylim = c(0,20))
count_outcome2021_H <- table(IU_Schedule2021_H$outcome)
barplot(count_outcome2021_H, main = "2020-21 Home Game Outcome Totals", xlab = "Outcome", ylim = c(0,20))
#bar graphs for away games
count_outcome2020_A <- table(IU_Schedule2020_A$outcome)
barplot(count_outcome2020_A, main = "2019-20 Away Game Outcome Totals", xlab = "Outcome", ylim = c(0,20))
count_outcome2021_A <- table(IU_Schedule2021_A$outcome)
barplot(count_outcome2021_A, main = "2020-21 Away Game Outcome Totals", xlab = "Outcome", ylim = c(0,20))
Observations from the bar graphs include:
In general the 2020-21 season under Archie Miller, appeared to be a struggle for the Hoosiers regardless of location.
Next, the analyst looked to do the same to evaluate Mike Woodson. Start by filtering the data as necessary:
#2019-2020 Season
IU_Schedule2022_H <- filter(IU_Schedule2022, location == "H")
IU_Schedule2022_A <- filter(IU_Schedule2022, location == "A")
#2020-2021 Season
IU_Schedule2023_H <- filter(IU_Schedule2023, location == "H")
IU_Schedule2023_A <- filter(IU_Schedule2023, location == "A")
Now, generate the bar graphs for easy visual comparison:
#used to show all four graphs in one visual for easy comparison
par(mfrow = c(2,2))
#bar graphs for home games
count_outcome2022_H <- table(IU_Schedule2022_H$outcome)
barplot(count_outcome2022_H, main = "2021-22 Home Game Outcome Totals", xlab = "Outcome", ylim = c(0,20))
count_outcome2023_H <- table(IU_Schedule2023_H$outcome)
barplot(count_outcome2023_H, main = "2022-23 Home Game Outcome Totals", xlab = "Outcome", ylim = c(0,20))
#bar graphs for away games
count_outcome2022_A <- table(IU_Schedule2022_A$outcome)
barplot(count_outcome2022_A, main = "2021-22 Away Game Outcome Totals", xlab = "Outcome", ylim = c(0,20))
count_outcome2023_A <- table(IU_Schedule2023_A$outcome)
barplot(count_outcome2023_A, main = "2022-23 Away Game Outcome Totals", xlab = "Outcome", ylim = c(0,20))
Observations from the bar graphs include:
In general, the Hoosiers have seen improvement under the guidance of Mike Woodson regradless of game location. They are being effective and finding more ways to win on the road.
After looking at general location information between the coaches, the analyst looked to compare mean scoring information over the course of the two seasons for each coach. First, start by getting the overall averages for Archie Miller.
mean(IU_ScheduleAM$team_score)
## [1] 70.64407
mean(IU_ScheduleAM$opp_score)
## [1] 67.9322
mean(IU_ScheduleAM$score_dif)
## [1] 2.711864
Now, find the averages over each respective season Archie Miller was coach in the past four season.
aggregate(cbind(team_score, opp_score, score_dif) ~ season, data = IU_ScheduleAM, FUN = mean, na.rm = TRUE)
## season team_score opp_score score_dif
## 1 2020 71.4375 66.71875 4.7187500
## 2 2021 69.7037 69.37037 0.3333333
Now, the same will be done for Mike Woodson starting with his overall averages.
mean(IU_ScheduleMW$team_score)
## [1] 72.89394
mean(IU_ScheduleMW$opp_score)
## [1] 67.24242
mean(IU_ScheduleMW$score_dif)
## [1] 5.651515
Then, the averages over each respective season Mike Woodson was coach over the past four seasons.
aggregate(cbind(team_score, opp_score, score_dif) ~ season, data = IU_ScheduleMW, FUN = mean, na.rm = TRUE)
## season team_score opp_score score_dif
## 1 2022 70.80000 66.17143 4.628571
## 2 2023 75.25806 68.45161 6.806452
The data will now be presented in a table for easier reading and comparison.
Observations from this table include:
Once again, Mike Woodson appears to be leading the program in the right direction.
After looking at the scoring averages, the analyst decided to look a little closer to score differential. While points scored is important, it is not necessarily relevant unless the analyst takes the other team’s score into account. For example, if the Hoosiers score a lofty 100 points in a game it would not mean much if the other team scored 101. So, the analyst looked to compare score differential trends for each coach over the course of the season. First, this was done by constructing a bar graph that averaged each coaches two years by each month (November through March).
p1 <- ggdraw() + draw_image("/Users/kamriefoster/Desktop/Coach Wood Analysis/Miller Score Dif Averages.png", scale = 1)
p2 <- ggdraw() + draw_image("/Users/kamriefoster/Desktop/Coach Wood Analysis/Woodson Score Dif Averages.png", scale = 1)
plot_grid(p1, p2, ncol = 2)
The following observations can be seen from the side by side comparisons of score differential by month:
After looking at general trends, the analyst then looked to break the score differential down by each individual season to see if there were any identifiable trends. A chart displaying the time series information is shown below:
p1 <- ggdraw() + draw_image("/Users/kamriefoster/Desktop/Coach Wood Analysis/2019-20 Season Dif.png", scale = 1)
p2 <- ggdraw() + draw_image("/Users/kamriefoster/Desktop/Coach Wood Analysis/2020-21 Season Dif.png", scale = 1)
p3 <- ggdraw() + draw_image("/Users/kamriefoster/Desktop/Coach Wood Analysis/2021-22 Season Dif.png", scale = 1)
p4 <- ggdraw() + draw_image("/Users/kamriefoster/Desktop/Coach Wood Analysis/2022-23 Season Dif.png", scale = 1)
plot_grid(p1, p2, p3, p4, ncol = 2)
The time series for each season all show a similar descending trend line in score differential. This appears to be an ongoing problem for the team regardless of who is coaching. The analyst looked to see if the intercept value occurred later in the year for each of the seasons:
This appears to show that there really was no difference in the trend lines other than in the 2020-21 season. Thus, when comparing the two coaches, this does not appear to offer any viable differences in regards to score differential as the season progresses.
After looking at score differential information, the analyst then wanted to compare opponents played between each of the coaches. The analysis is more comparable if there is a control variable. Thus, if the analyst finds common opponents then the data between the two will be more comparable.
#used to show all four graphs in one visual for easy comparison
par(mfrow = c(1,2))
count_opponentsAM <- table(IU_ScheduleAM$opponent)
barplot(count_opponentsAM, las = 2, cex.names = 0.3, ylab = "Number of Times Played", main = "Archie Miller Opponents")
count_opponentsMW <- table(IU_ScheduleMW$opponent)
barplot(count_opponentsMW, las = 2, cex.names = 0.3, ylab = "Number of Times Played", main = "Mike Woodson Opponents")
The visuals above display that the common teams between the two coaches are conference games. Thus, it may be necessary to filter the data to only host data from the teams within the Big Ten Conference. These teams include:
With having a control group in mind, the analyst then wanted to look at each in-conference team played and derive specific information per coach. First, filter the data to only include information on the Big Ten teams for each coach.
#Archie Miller data by in-conference teams
IU_ScheduleAM_conf <- filter(IU_ScheduleAM, opponent == "Illinos" | opponent == "Iowa" | opponent == "Maryland" | opponent == "Michigan" | opponent == "Michigan State" | opponent == "Minnesota" | opponent == "Nebraska" | opponent == "Northwestern" | opponent == "Ohio State" | opponent == "Penn State" | opponent == "Purdue" | opponent == "Rutgers" | opponent == "Wisconsin" | opponent == "Illinois")
#Mike Woodson data by in-conference team
IU_ScheduleMW_conf <- filter(IU_ScheduleMW, opponent == "Illinos" | opponent == "Iowa" | opponent == "Maryland" | opponent == "Michigan" | opponent == "Michigan State" | opponent == "Minnesota" | opponent == "Nebraska" | opponent == "Northwestern" | opponent == "Ohio State" | opponent == "Penn State" | opponent == "Purdue" | opponent == "Rutgers" | opponent == "Wisconsin" | opponent == "Illinois")
After filtering the data, find the overall conference records for each coach in the two seasons of data available:
#Archie Miller conference record
AMconf_outcome_count <- count(IU_ScheduleAM_conf, outcome)
AMconf_outcome_count
## # A tibble: 2 × 2
## outcome n
## <chr> <int>
## 1 L 24
## 2 W 17
#Mike Woodson conference record
MWconf_outcome_count <- count(IU_ScheduleMW_conf, outcome)
MWconf_outcome_count
## # A tibble: 2 × 2
## outcome n
## <chr> <int>
## 1 L 20
## 2 W 23
Archie Miller had an overall conference record of 17-21. His in-conference win percentage was 44.7%. In contrast, Mike Woodson posted a conference record of 20-19, resulting in a 51.3% win percentage. Clearly, the first two years of coaching for Mike Woodson were more successful in terms of outcome than Archie Miller’s last two years.
The analyst then wanted to derive specific information on the coach’s record against each opponent. This was done using Tableau. The generated bar graphs are included below:
Observations from the visual include:
Observations from the visual include:
After observing and analyzing the generated tables, it was time to compare the findings between the two coaches:
Miller had a losing record against 8 of the 13 conference teams. Including, 5 teams that remained unbeaten in his last two seasons as head coach. In comparison, Woodson posted a losing record against 6 of the 13 teams. Including, just two teams that remain unbeaten. For Miller these teams included: Illinois, Michigan, Purdue, Rutgers, and Wisconsin. However, for Woodson these included Iowa and Northwestern. It is interesting to see that the teams that were unbeaten in each two year time span were different for both coaches.
Conversely, Miller had a winning record against 5 of the 13 conference teams. This including 3 teams that Miller was able to beat every time. Coach Woodson on the other hand, had a winning record competing against 7 of the conference teams. Woodson beat just 2 teams every time. Miller’s perfect record teams include Iowa, Minnesota, and Nebraska, while Woodson’s perfect record teams also included Minnesota and Nebraska.
It appears that Woodson’s two years were more successful in conference play in comparison to Miller’s last two years. Although one team went from being beat every time under Miller, to beating the Hoosiers every time under Woodson the overall number of winning records for Woodson was well above that of Miller. Once again, alluding to improvement within the program.
Similar to before, the analyst then looked to see how each coach performed over the course of the season within conference play. While all conference play is important, the end of the season is where conference titles and automatic tournament births are granted. Tablea was used to create the following time series visuals in relation to average score differential by month for in conference games.
It is clear from the visuals that Coach Miller’s teams struggled in conference play no matter which part of the season the games occured. Particularly, the beginning of conference play (in December) appeared to be the worst for Archie Miller. Although the team generally saw improvement throughout the course of the season, they could never seem to outscore their opponents always having a negative score differential (the other team was scoring more points than there own). This was then compared to Mike Woodson’s teams:
In contrast, Mike Woodson’s teams have had relative success throughout the season with in conference games. In fact, in three out of the four months of conference play the Hoosiers posted a postive point differential against their opponents, which was much different than the outcomes they had gotten under Miller. However, there still does appear to be a period of struggle under Woodson that is marked by February. Although, this visual demonstrates there are still short comings within the program, there is significant proof that the Hoosiers are improving their play under the current head coach.
This was then taken one step further to compare score differential by date in each of the seasons.
p1 <- ggdraw() + draw_image("/Users/kamriefoster/Desktop/Coach Wood Analysis/Score Dif Conf 2019-20.png", scale = 1)
p2 <- ggdraw() + draw_image("/Users/kamriefoster/Desktop/Coach Wood Analysis/Score Dif Conf 2020-21.png", scale = 1)
p3 <- ggdraw() + draw_image("/Users/kamriefoster/Desktop/Coach Wood Analysis/Score Dif Conf 2021-22.png", scale = 1)
p4 <- ggdraw() + draw_image("/Users/kamriefoster/Desktop/Coach Wood Analysis/Score Dif Conf 2022-23.png", scale = 1)
plot_grid(p1, p2, p3, p4, ncol = 2)
The following observations can be observed:
Overall, the two graphs on the bottom display a better trend in score differential than the top two, displaying that in the past two years the Hoosiers have been improving under Mike Woodson.
Now look at general scoring trends per opponent for each coach, starting with Archie Miller.
aggregate(cbind(team_score, opp_score, score_dif) ~ opponent, data = IU_ScheduleAM_conf, FUN = mean, na.rm = TRUE)
## opponent team_score opp_score score_dif
## 1 Illinois 65.66667 70.33333 -4.666667
## 2 Iowa 79.00000 70.33333 8.666667
## 3 Maryland 66.00000 69.00000 -3.000000
## 4 Michigan 61.00000 81.00000 -20.000000
## 5 Michigan State 65.33333 68.33333 -3.000000
## 6 Minnesota 74.00000 65.00000 9.000000
## 7 Nebraska 87.75000 76.00000 11.750000
## 8 Northwestern 70.66667 70.66667 0.000000
## 9 Ohio State 61.33333 66.66667 -5.333333
## 10 Penn State 68.00000 69.66667 -1.666667
## 11 Purdue 59.50000 69.75000 -10.250000
## 12 Rutgers 58.25000 67.00000 -8.750000
## 13 Wisconsin 64.33333 74.66667 -10.333333
aggregate(cbind(team_score, opp_score, score_dif) ~ opponent, data = IU_ScheduleMW_conf, FUN = mean, na.rm = TRUE)
## opponent team_score opp_score score_dif
## 1 Illinois 68.25000 67.50000 0.750000
## 2 Iowa 77.00000 86.00000 -9.000000
## 3 Maryland 65.66667 61.66667 4.000000
## 4 Michigan 68.25000 70.75000 -2.500000
## 5 Michigan State 69.33333 75.00000 -5.666667
## 6 Minnesota 72.66667 65.33333 7.333333
## 7 Nebraska 75.66667 63.66667 12.000000
## 8 Northwestern 65.33333 69.00000 -3.666667
## 9 Ohio State 74.00000 67.00000 7.000000
## 10 Penn State 66.00000 67.66667 -1.666667
## 11 Purdue 73.25000 69.75000 3.500000
## 12 Rutgers 59.00000 63.00000 -4.000000
## 13 Wisconsin 63.66667 61.00000 2.666667
Observations between the two data tables include:
It is clearly demonstrated that under Woodson’s guidance the Hoosiers competed better against their in-conference teams in terms of overall wins and losses as well as score differential. Overall, this is another good mark for the Hoosiers and their decision to higher Woodson. Other components of the game still should be evaluated.
Before any descriptive analysis was done on this table, it was necessary to find the variables that contributed the most influence to game outcome over the past four season. Since a box score has so many variables in it, it is necessary to limit the scope of the research being conducted. Otherwise solid and well informed conclusions may be hard to configure. With this being said, several models were developed to ensure consistency in variable identification.
Since the analyst is interested in team trends and not individual players, the box score was filtered to only include the team’s box score lines for each game.
IU_TeamBS <- filter(IU_BoxScores, player == "TEAM")
After filtering to only include team information, the data was then filtered to only include conference opponent games. Since there was consistency in these opponents across both coaches, this acts as a control group.
IU_BS_conf <- filter(IU_TeamBS, opponent == "Illinos" | opponent == "Iowa" | opponent == "Maryland" | opponent == "Michigan" | opponent == "Michigan State" | opponent == "Minnesota" | opponent == "Nebraska" | opponent == "Northwestern" | opponent == "Ohio State" | opponent == "Penn State" | opponent == "Purdue" | opponent == "Rutgers" | opponent == "Wisconsin" | opponent == "Illinois")
Now, some rows of the data will be deleted as they have reoccurring information that adds no value to the analysis.
IU_BS_conf <- IU_BS_conf[,-22]
IU_BS_conf <- IU_BS_conf[,-21]
#IU_BS_conf <- IU_BS_conf[,-20]
#IU_BS_conf <- IU_BS_conf[,-19]
IU_BS_conf <- IU_BS_conf[,-13]
IU_BS_conf <- IU_BS_conf[,-4]
IU_BS_conf <- IU_BS_conf[,-3]
IU_BS_conf <- IU_BS_conf[,-2]
IU_BS_conf <- IU_BS_conf[,-1]
IU_BS_conf
## # A tibble: 83 × 18
## FGM FGA ThreePTM ThreePTA FTM FTA OREB DREB AST STL BLK TO
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 50 5 14 17 26 10 18 12 3 3 12
## 2 32 68 5 25 27 38 19 35 14 5 6 15
## 3 22 61 4 18 11 18 15 27 7 3 6 14
## 4 20 54 3 14 23 30 15 25 11 9 4 16
## 5 20 49 6 12 20 36 8 27 10 11 4 11
## 6 19 60 2 19 10 12 11 26 6 4 2 16
## 7 31 61 8 26 12 20 12 36 21 2 8 16
## 8 26 57 4 12 11 20 10 21 12 6 2 8
## 9 30 57 9 19 7 10 9 24 22 0 2 6
## 10 19 57 2 11 9 10 11 33 9 3 3 17
## # … with 73 more rows, and 6 more variables: PF <dbl>, PTS <dbl>, POSS <dbl>,
## # opponent <chr>, location <chr>, outcome <dbl>
Next, create generate shooting percentages within the data set.
IU_BS_conf$FGP <- (IU_BS_conf$FGM / IU_BS_conf$FGA) * 100
IU_BS_conf$ThreePTP <- (IU_BS_conf$ThreePTM / IU_BS_conf$ThreePTA) * 100
IU_BS_conf$FTP <- (IU_BS_conf$FTM / IU_BS_conf$FTA) * 100
Now, delete the columns that were used to make the new columns so that colinearity does not become a factor for the models:
IU_BS_conf <- IU_BS_conf[,-(1:6)]
The last step is to create a training and testing set to generate and test the models. Since the model is less focused on predictive ability and more focused on variable importance, the data will be split into 90% training and 10% testing. Model accuracy will be found to provide the analyst with which model is the best to help rank the variable importance. The code to produce the training and testing sets is displayed below:
set.seed(1234) #used to provide consistency across the training and testing sets in successive runs
index <- sample(nrow(IU_BS_conf), nrow(IU_BS_conf)*0.90)
IU_BS_conf_train = IU_BS_conf[index,]
IU_BS_conf_test = IU_BS_conf[-index,]
To find important variables in the box score data set in terms of game outcome, the analyst first created a logistic regression model. This is one of the simplest predictive models, but still can be one of the most effective. The steps to create a statistically significant logistic regression model are included below:
## Train a logistic regression model with all variables
glm0 <- glm(outcome~ ., family = binomial, data = IU_BS_conf_train)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#the family of models we are using is a binary variable.
summary(glm0)
##
## Call:
## glm(formula = outcome ~ ., family = binomial, data = IU_BS_conf_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.364e-05 -2.110e-08 -2.110e-08 2.110e-08 2.205e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.327e+02 1.612e+06 0.000 1.000
## OREB -4.519e+00 2.390e+04 0.000 1.000
## DREB 1.883e+01 1.388e+04 0.001 0.999
## AST -1.236e+01 2.260e+04 -0.001 1.000
## STL 2.206e+01 2.257e+04 0.001 0.999
## BLK 3.374e+00 2.286e+04 0.000 1.000
## TO 1.018e+01 2.905e+04 0.000 1.000
## PF -3.189e+00 1.570e+04 0.000 1.000
## PTS 9.380e+00 2.761e+04 0.000 1.000
## POSS -1.717e+01 2.867e+04 -0.001 1.000
## opponentIowa 1.074e+01 1.976e+05 0.000 1.000
## opponentMaryland 2.303e+01 1.951e+05 0.000 1.000
## opponentMichigan 1.225e+02 1.704e+05 0.001 0.999
## opponentMichigan State 4.131e+01 1.721e+05 0.000 1.000
## opponentMinnesota -4.345e+00 2.256e+05 0.000 1.000
## opponentNebraska -3.472e+01 3.589e+05 0.000 1.000
## opponentNorthwestern -3.852e+01 2.233e+05 0.000 1.000
## opponentOhio State -4.011e+00 1.761e+05 0.000 1.000
## opponentPenn State 6.474e+01 2.933e+05 0.000 1.000
## opponentPurdue 1.026e+02 2.386e+05 0.000 1.000
## opponentRutgers -7.173e+00 1.490e+05 0.000 1.000
## opponentWisconsin -5.507e+01 1.340e+05 0.000 1.000
## locationH 7.684e+00 1.040e+05 0.000 1.000
## locationN 5.747e+01 1.306e+05 0.000 1.000
## FGP -4.005e-01 2.112e+04 0.000 1.000
## ThreePTP 2.855e-01 5.709e+03 0.000 1.000
## FTP -3.134e+00 7.109e+03 0.000 1.000
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.0253e+02 on 73 degrees of freedom
## Residual deviance: 3.7386e-09 on 47 degrees of freedom
## AIC: 54
##
## Number of Fisher Scoring iterations: 25
Start by removing opponent within the logistic model, since it appears to be the problem in regards to perfect linearity within the model.
glm1 <- glm(outcome ~ OREB +DREB + AST +STL +BLK +TO + PF + location + FGP + ThreePTP + FTP, data = IU_BS_conf_train)
summary(glm1)
##
## Call:
## glm(formula = outcome ~ OREB + DREB + AST + STL + BLK + TO +
## PF + location + FGP + ThreePTP + FTP, data = IU_BS_conf_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.78686 -0.24335 0.02487 0.23441 0.79378
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.6993854 0.5492345 -4.915 7.02e-06 ***
## OREB 0.0189809 0.0152656 1.243 0.218488
## DREB 0.0510768 0.0100081 5.104 3.51e-06 ***
## AST -0.0343935 0.0152200 -2.260 0.027419 *
## STL 0.0692396 0.0188813 3.667 0.000517 ***
## BLK 0.0294810 0.0220482 1.337 0.186149
## TO -0.0250526 0.0137841 -1.817 0.074053 .
## PF -0.0015966 0.0134431 -0.119 0.905850
## locationH 0.1009491 0.1035476 0.975 0.333458
## locationN 0.0991390 0.2069266 0.479 0.633579
## FGP 0.0427634 0.0087269 4.900 7.40e-06 ***
## ThreePTP 0.0030286 0.0037434 0.809 0.421626
## FTP -0.0004662 0.0041730 -0.112 0.911418
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1314617)
##
## Null deviance: 18.4865 on 73 degrees of freedom
## Residual deviance: 8.0192 on 61 degrees of freedom
## AIC: 73.558
##
## Number of Fisher Scoring iterations: 2
Now, the variable with the highest p-value will be removed until all variables lie under the 0.05 mark of significance. Starting with the variable with the highest p-value value then regenerate the model and rerun it. First start by removing FTP (p-value of 0.911418).
glm2 <- glm(outcome ~ OREB +DREB + AST +STL +BLK +TO + PF + location + FGP + ThreePTP, data = IU_BS_conf_train)
summary(glm2)
##
## Call:
## glm(formula = outcome ~ OREB + DREB + AST + STL + BLK + TO +
## PF + location + FGP + ThreePTP, data = IU_BS_conf_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.78531 -0.23875 0.02996 0.23290 0.79034
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.728116 0.481418 -5.667 4.04e-07 ***
## OREB 0.019392 0.014697 1.319 0.191873
## DREB 0.050960 0.009874 5.161 2.74e-06 ***
## AST -0.034404 0.015098 -2.279 0.026142 *
## STL 0.069152 0.018714 3.695 0.000467 ***
## BLK 0.029543 0.021865 1.351 0.181559
## TO -0.025298 0.013499 -1.874 0.065644 .
## PF -0.001735 0.013279 -0.131 0.896478
## locationH 0.098588 0.100557 0.980 0.330690
## locationN 0.098361 0.205156 0.479 0.633309
## FGP 0.042805 0.008649 4.949 6.03e-06 ***
## ThreePTP 0.003024 0.003713 0.815 0.418475
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1293679)
##
## Null deviance: 18.4865 on 73 degrees of freedom
## Residual deviance: 8.0208 on 62 degrees of freedom
## AIC: 71.573
##
## Number of Fisher Scoring iterations: 2
Now, remove PF from the model (p-value of 0.896478):
glm3 <- glm(outcome ~ OREB +DREB + AST +STL +BLK + TO + location + FGP + ThreePTP, data = IU_BS_conf_train)
summary(glm3)
##
## Call:
## glm(formula = outcome ~ OREB + DREB + AST + STL + BLK + TO +
## location + FGP + ThreePTP, data = IU_BS_conf_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.79211 -0.23701 0.02454 0.23082 0.79408
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.754167 0.434750 -6.335 2.84e-08 ***
## OREB 0.019144 0.014460 1.324 0.190307
## DREB 0.051163 0.009674 5.289 1.65e-06 ***
## AST -0.034311 0.014963 -2.293 0.025196 *
## STL 0.069125 0.018567 3.723 0.000422 ***
## BLK 0.029264 0.021590 1.355 0.180121
## TO -0.025485 0.013318 -1.914 0.060227 .
## locationH 0.100935 0.098165 1.028 0.307778
## locationN 0.099364 0.203406 0.489 0.626890
## FGP 0.042669 0.008519 5.009 4.71e-06 ***
## ThreePTP 0.003046 0.003681 0.827 0.411094
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1273494)
##
## Null deviance: 18.486 on 73 degrees of freedom
## Residual deviance: 8.023 on 63 degrees of freedom
## AIC: 69.593
##
## Number of Fisher Scoring iterations: 2
Then, remove the variable location (p-value of 0.626890).
glm4 <- glm(outcome ~ OREB +DREB + AST +STL +BLK + TO + FGP + ThreePTP, data = IU_BS_conf_train)
summary(glm4)
##
## Call:
## glm(formula = outcome ~ OREB + DREB + AST + STL + BLK + TO +
## FGP + ThreePTP, data = IU_BS_conf_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7299 -0.2329 0.0369 0.2156 0.7434
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.839364 0.419927 -6.762 4.57e-09 ***
## OREB 0.021678 0.014016 1.547 0.1268
## DREB 0.051360 0.009564 5.370 1.13e-06 ***
## AST -0.035358 0.013767 -2.568 0.0125 *
## STL 0.076635 0.017065 4.491 2.98e-05 ***
## BLK 0.029000 0.021440 1.353 0.1809
## TO -0.027730 0.013026 -2.129 0.0371 *
## FGP 0.044986 0.008059 5.582 4.99e-07 ***
## ThreePTP 0.003198 0.003600 0.888 0.3777
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1256607)
##
## Null deviance: 18.4865 on 73 degrees of freedom
## Residual deviance: 8.1679 on 65 degrees of freedom
## AIC: 66.918
##
## Number of Fisher Scoring iterations: 2
Now, remove the variable ThreePTP (p-value of 0.3777).
glm5 <- glm(outcome ~ OREB +DREB + AST +STL +BLK + TO + FGP, data = IU_BS_conf_train)
summary(glm5)
##
## Call:
## glm(formula = outcome ~ OREB + DREB + AST + STL + BLK + TO +
## FGP, data = IU_BS_conf_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.68930 -0.23238 0.04607 0.19598 0.76797
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.821227 0.418759 -6.737 4.74e-09 ***
## OREB 0.021604 0.013993 1.544 0.1274
## DREB 0.049572 0.009335 5.310 1.38e-06 ***
## AST -0.033562 0.013596 -2.468 0.0162 *
## STL 0.077769 0.016990 4.577 2.14e-05 ***
## BLK 0.031279 0.021252 1.472 0.1458
## TO -0.027660 0.013005 -2.127 0.0372 *
## FGP 0.047194 0.007654 6.166 4.77e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1252591)
##
## Null deviance: 18.4865 on 73 degrees of freedom
## Residual deviance: 8.2671 on 66 degrees of freedom
## AIC: 65.811
##
## Number of Fisher Scoring iterations: 2
Next, removal of BLK from the model (p-value of 0.1458).
glm6 <- glm(outcome ~ OREB +DREB + AST +STL + TO + FGP, data = IU_BS_conf_train)
summary(glm6)
##
## Call:
## glm(formula = outcome ~ OREB + DREB + AST + STL + TO + FGP, data = IU_BS_conf_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.68396 -0.28410 0.03347 0.22680 0.73557
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.854931 0.421756 -6.769 3.91e-09 ***
## OREB 0.021101 0.014110 1.495 0.1395
## DREB 0.052555 0.009191 5.718 2.72e-07 ***
## AST -0.030042 0.013500 -2.225 0.0294 *
## STL 0.077911 0.017137 4.546 2.35e-05 ***
## TO -0.026437 0.013091 -2.019 0.0474 *
## FGP 0.047865 0.007706 6.211 3.80e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1274394)
##
## Null deviance: 18.4865 on 73 degrees of freedom
## Residual deviance: 8.5384 on 67 degrees of freedom
## AIC: 66.201
##
## Number of Fisher Scoring iterations: 2
Then, OREB needed to be removed from the model (p-value of 0.1395).
glm7 <- glm(outcome ~ DREB + AST +STL + TO + FGP, data = IU_BS_conf_train)
summary(glm7)
##
## Call:
## glm(formula = outcome ~ DREB + AST + STL + TO + FGP, data = IU_BS_conf_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.64272 -0.25725 0.01201 0.22957 0.68877
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.659992 0.404736 -6.572 8.29e-09 ***
## DREB 0.054565 0.009174 5.948 1.05e-07 ***
## AST -0.025655 0.013297 -1.929 0.0579 .
## STL 0.082628 0.016996 4.861 7.20e-06 ***
## TO -0.019890 0.012449 -1.598 0.1147
## FGP 0.043254 0.007127 6.069 6.43e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1297567)
##
## Null deviance: 18.4865 on 73 degrees of freedom
## Residual deviance: 8.8235 on 68 degrees of freedom
## AIC: 66.631
##
## Number of Fisher Scoring iterations: 2
Now, TO needed to be removed from the model (p-value of 0.1147).
glm8 <- glm(outcome ~ DREB + AST +STL + FGP, data = IU_BS_conf_train)
summary(glm8)
##
## Call:
## glm(formula = outcome ~ DREB + AST + STL + FGP, data = IU_BS_conf_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.71782 -0.25344 0.00512 0.25033 0.79047
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.789618 0.400958 -6.957 1.59e-09 ***
## DREB 0.053195 0.009236 5.759 2.15e-07 ***
## AST -0.026212 0.013441 -1.950 0.0552 .
## STL 0.082870 0.017186 4.822 8.17e-06 ***
## FGP 0.041966 0.007160 5.861 1.43e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1326765)
##
## Null deviance: 18.4865 on 73 degrees of freedom
## Residual deviance: 9.1547 on 69 degrees of freedom
## AIC: 67.358
##
## Number of Fisher Scoring iterations: 2
Lastly, remove the variable AST (p-value of 0.0552).
glm9 <- glm(outcome ~ DREB +STL + FGP, data = IU_BS_conf_train)
summary(glm9)
##
## Call:
## glm(formula = outcome ~ DREB + STL + FGP, data = IU_BS_conf_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.81631 -0.25910 0.03403 0.25014 0.83944
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.673482 0.404371 -6.611 6.33e-09 ***
## DREB 0.047562 0.008947 5.316 1.20e-06 ***
## STL 0.082876 0.017527 4.729 1.14e-05 ***
## FGP 0.034650 0.006219 5.571 4.39e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.137989)
##
## Null deviance: 18.4865 on 73 degrees of freedom
## Residual deviance: 9.6592 on 70 degrees of freedom
## AIC: 69.328
##
## Number of Fisher Scoring iterations: 2
Finally, all of the variables were considered statistically significant and the logistic regression model found the following variables to be significant:
Now, to be able to compare this model to other models use the testing set to evaluate how well the model predicts game outcome and find the misclassification rate.
pred_glm9_test <- predict(glm9, newdata = IU_BS_conf_test, type = "response")
table(IU_BS_conf_test$outcome, (pred_glm9_test > 0.5)*1, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 0 1
## 0 4 2
## 1 0 3
The misclassification rate of the logistic regression model is 2/9 or 22.2%.
After returning values for the logistic regression, the analyst then looked to generate a classification tree to define the variable importance using the same training set. The code to do this is displayed below:
rpart0 <- rpart(formula = outcome ~ ., data = IU_BS_conf, method = "class")
rpart0
## n= 83
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 83 39 0 (0.53012048 0.46987952)
## 2) PTS< 65.5 33 6 0 (0.81818182 0.18181818)
## 4) DREB< 29 26 2 0 (0.92307692 0.07692308) *
## 5) DREB>=29 7 3 1 (0.42857143 0.57142857) *
## 3) PTS>=65.5 50 17 1 (0.34000000 0.66000000)
## 6) opponent=Illinois,Iowa,Maryland,Michigan State,Northwestern,Purdue,Rutgers,Wisconsin 29 14 0 (0.51724138 0.48275862)
## 12) PF>=20.5 8 1 0 (0.87500000 0.12500000) *
## 13) PF< 20.5 21 8 1 (0.38095238 0.61904762)
## 26) DREB< 24.5 10 4 0 (0.60000000 0.40000000) *
## 27) DREB>=24.5 11 2 1 (0.18181818 0.81818182) *
## 7) opponent=Michigan,Minnesota,Nebraska,Ohio State,Penn State 21 2 1 (0.09523810 0.90476190) *
prp(rpart0, digits = 4, extra = 1)
When the classification tree was generated, the following variables were denoted as important in relation to outcome:
This model has one variable in common with the logistic regression model. The analyst then compared model accuracy by generating the misclassification rate for the classification tree.
pred0 <- predict(rpart0, IU_BS_conf_test, type = "class")
#table representing the number of predictions matched correctly with the testing set
table(IU_BS_conf_test$outcome, pred0, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 6 0
## 1 0 3
The model was perfect at predicting the testing set (9/9) observation. This is a little alarming, however since the analyst is only looking for variable importance as opposed to being able to predict this will be ignored.
After completing the classification tree, the analyst then looked to find variable significance using random forests.
First, change any categorical variables into factors.
IU_BS_conf_train$outcome <- as.factor(IU_BS_conf_train$outcome)
IU_BS_conf_test$outcome <- as.factor(IU_BS_conf_test$outcome)
IU_BS_conf_train$opponent <- as.factor(IU_BS_conf_train$opponent)
IU_BS_conf_test$opponent <- as.factor(IU_BS_conf_test$opponent)
IU_BS_conf_train$location <- as.factor(IU_BS_conf_train$location)
IU_BS_conf_test$location <- as.factor(IU_BS_conf_test$location)
Now, the random forest model can be generated.
rf0 <- randomForest(outcome~., data = IU_BS_conf_train, importance = TRUE)
rf0
##
## Call:
## randomForest(formula = outcome ~ ., data = IU_BS_conf_train, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 32.43%
## Confusion matrix:
## 0 1 class.error
## 0 21 17 0.4473684
## 1 7 29 0.1944444
After the model has been created, variable importance can be found by using the following code. The results displayed here will rank the importance similar to that of the regression model done earlier. Note that the variables with the highest MeanDecreaseAccuracy are the most important variables.
rf0$importance
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## OREB -0.0045716768 -0.0031017532 -0.0034256084 1.330370
## DREB 0.0198105896 0.0263716827 0.0233921829 4.014946
## AST 0.0059317373 0.0032623310 0.0045329569 1.819215
## STL 0.0061118361 0.0101952693 0.0084751543 2.518112
## BLK -0.0003306202 0.0008541847 -0.0003431658 1.644625
## TO 0.0013874397 -0.0031745179 -0.0009305782 1.416572
## PF -0.0038763941 -0.0013647694 -0.0023461074 1.368019
## PTS 0.0351203506 0.0316189009 0.0324080885 4.650948
## POSS 0.0007125838 0.0054537647 0.0024841964 1.878996
## opponent 0.0067425759 0.0165795212 0.0112263215 6.988677
## location 0.0002209774 0.0134348753 0.0061603603 1.484484
## FGP 0.0248471299 0.0193728572 0.0216079159 3.553630
## ThreePTP -0.0020214087 0.0068091265 0.0030250211 2.212006
## FTP -0.0051409723 -0.0020567914 -0.0039221743 1.609345
The values that were found to be most important in relation to game outcome for the random forest model are as listed below:
There are once again reoccurring variables, as well as new variables identified within this model. Now, the analyst will use the testing set to find the misclassification rate.
#a trick to ensure that the levels of the training set and testing set are the same to avoid any error clauses
IU_BS_conf_test <- rbind(IU_BS_conf_train[1, ] , IU_BS_conf_test)
IU_BS_conf_test <- IU_BS_conf_test[-1,]
rf_pred <- predict(rf0, IU_BS_conf_test)
#shows how our predictions match up against the actual values.
table(IU_BS_conf_test$outcome, rf_pred, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 5 1
## 1 0 3
This model has a misclassification rate of 1/9 or about 11.1%. This model is better than the logistic regression model but worse than the classification trees.
The following variable was found to be significant for game outcome in all models: DREB. While it was the only variable to be consistent in all three models, there were a number of variables that were found to be important in at least two models. These include: FGP, STL, PTS, and opponent. There were two variables that were found to be significant in just one model out of the three. These include: PF and location.
Not to say that the analyst is limited to just looking at variables included in the list above, however the general focus will be on these variables as comparisons across coaches are being made.
First, the analyst needed to format the data so it presented box score information for the duration of each coach.
IU_BS_conf <- filter(IU_TeamBS, opponent == "Illinos" | opponent == "Iowa" | opponent == "Maryland" | opponent == "Michigan" | opponent == "Michigan State" | opponent == "Minnesota" | opponent == "Nebraska" | opponent == "Northwestern" | opponent == "Ohio State" | opponent == "Penn State" | opponent == "Purdue" | opponent == "Rutgers" | opponent == "Wisconsin" | opponent == "Illinois")
Remake the shooting percentages.
IU_BS_conf$FGP <- (IU_BS_conf$FGM / IU_BS_conf$FGA) * 100
IU_BS_conf$ThreePTP <- (IU_BS_conf$ThreePTM / IU_BS_conf$ThreePTA) * 100
IU_BS_conf$FTP <- (IU_BS_conf$FTM / IU_BS_conf$FTA) * 100
Now, remove the unnecessary columns from the data set.
IU_BS_conf <- IU_BS_conf[,-22]
IU_BS_conf <- IU_BS_conf[,-21]
IU_BS_conf <- IU_BS_conf[,-20]
IU_BS_conf <- IU_BS_conf[,-(2:10)]
This data was then formatted to match the seasons each coach lead the team. This is 2019-20 and 2020-21 for Archie Miller and 2021-22 and 2022-23 for Mike Woodson.
IU_BS_confAM <- filter(IU_BS_conf, season == "2020" | season == "2021")
IU_BS_confMW <- filter(IU_BS_conf, season == "2022" | season == "2023")
Now, each of these were aggregated for comparison. First, by the two season span for Archie Miller.
summary(IU_BS_confAM)
## season OREB DREB REB AST
## Min. :2020 Min. : 2.0 Min. :14.00 Min. :21.00 Min. : 6.00
## 1st Qu.:2020 1st Qu.: 8.0 1st Qu.:21.00 1st Qu.:28.75 1st Qu.: 9.75
## Median :2020 Median :10.0 Median :25.00 Median :35.00 Median :13.00
## Mean :2020 Mean :10.1 Mean :24.85 Mean :34.95 Mean :12.53
## 3rd Qu.:2021 3rd Qu.:12.0 3rd Qu.:27.00 3rd Qu.:39.00 3rd Qu.:15.00
## Max. :2021 Max. :19.0 Max. :36.00 Max. :54.00 Max. :22.00
## STL BLK TO PF
## Min. : 0.000 Min. :0.000 Min. : 6.00 Min. :10.00
## 1st Qu.: 3.750 1st Qu.:2.000 1st Qu.: 9.00 1st Qu.:14.00
## Median : 6.000 Median :3.000 Median :12.00 Median :17.50
## Mean : 5.425 Mean :3.475 Mean :11.85 Mean :17.48
## 3rd Qu.: 7.000 3rd Qu.:5.000 3rd Qu.:14.25 3rd Qu.:19.25
## Max. :11.000 Max. :8.000 Max. :17.00 Max. :28.00
## PTS opponent location outcome
## Min. :49.00 Length:40 Length:40 Min. :0.0
## 1st Qu.:59.00 Class :character Class :character 1st Qu.:0.0
## Median :66.50 Mode :character Mode :character Median :0.0
## Mean :67.45 Mean :0.4
## 3rd Qu.:72.25 3rd Qu.:1.0
## Max. :96.00 Max. :1.0
## FGP ThreePTP FTP
## Min. :25.42 Min. :10.00 Min. :40.00
## 1st Qu.:37.23 1st Qu.:21.33 1st Qu.:60.83
## Median :42.30 Median :33.33 Median :68.20
## Mean :42.18 Mean :33.01 Mean :67.58
## 3rd Qu.:45.96 3rd Qu.:41.11 3rd Qu.:76.52
## Max. :57.78 Max. :62.50 Max. :90.00
Next, the information was derived by each season Archie Miller was head coach.
aggregate(cbind(OREB, DREB, AST, STL, BLK, TO, PF, PTS, FGP, ThreePTP, FTP) ~ season, data = IU_BS_confAM, FUN = mean, na.rm = TRUE)
## season OREB DREB AST STL BLK TO PF PTS FGP ThreePTP FTP
## 1 2020 11.05 25.3 11.75 5.10 3.80 12.2 17.15 66.45 41.78453 32.68554 68.42506
## 2 2021 9.15 24.4 13.30 5.75 3.15 11.5 17.80 68.45 42.57464 33.34246 66.74194
These values will be listed in an organized set of tables under the conclusion tab.
Now, the same was done for Mike Woodson. First, starting with the overall averages for both seasons.
summary(IU_BS_confMW)
## season OREB DREB REB
## Min. :2022 Min. : 2.000 Min. :14.00 Min. :20.00
## 1st Qu.:2022 1st Qu.: 7.000 1st Qu.:22.50 1st Qu.:31.00
## Median :2022 Median : 9.000 Median :25.00 Median :34.00
## Mean :2022 Mean : 8.791 Mean :25.44 Mean :34.23
## 3rd Qu.:2023 3rd Qu.:10.500 3rd Qu.:29.00 3rd Qu.:39.00
## Max. :2023 Max. :15.000 Max. :35.00 Max. :45.00
## AST STL BLK TO
## Min. : 6.00 Min. : 1.000 Min. : 1.000 Min. : 3.00
## 1st Qu.:11.00 1st Qu.: 4.000 1st Qu.: 3.000 1st Qu.: 9.00
## Median :14.00 Median : 5.000 Median : 4.000 Median :10.00
## Mean :13.91 Mean : 5.302 Mean : 4.442 Mean :10.81
## 3rd Qu.:16.00 3rd Qu.: 7.000 3rd Qu.: 6.000 3rd Qu.:13.00
## Max. :22.00 Max. :11.000 Max. :10.000 Max. :23.00
## PF PTS opponent location
## Min. :10.0 Min. :48.00 Length:43 Length:43
## 1st Qu.:15.0 1st Qu.:62.00 Class :character Class :character
## Median :18.0 Median :68.00 Mode :character Mode :character
## Mean :17.3 Mean :68.81
## 3rd Qu.:20.0 3rd Qu.:74.50
## Max. :25.0 Max. :89.00
## outcome FGP ThreePTP FTP
## Min. :0.0000 Min. :30.36 Min. :12.50 Min. : 46.15
## 1st Qu.:0.0000 1st Qu.:40.98 1st Qu.:26.79 1st Qu.: 66.67
## Median :1.0000 Median :45.16 Median :31.58 Median : 71.43
## Mean :0.5349 Mean :45.33 Mean :34.62 Mean : 71.98
## 3rd Qu.:1.0000 3rd Qu.:49.44 3rd Qu.:40.83 3rd Qu.: 78.17
## Max. :1.0000 Max. :61.82 Max. :76.92 Max. :100.00
Then, the aggregation by season was done to find season averages.
aggregate(cbind(OREB, DREB, AST, STL, BLK, TO, PF, PTS, FGP, ThreePTP, FTP) ~ season, data = IU_BS_confMW, FUN = mean, na.rm = TRUE)
## season OREB DREB AST STL BLK TO PF
## 1 2022 8.695652 25.34783 14.13043 5.652174 4.434783 10.04348 17.13043
## 2 2023 8.900000 25.55000 13.65000 4.900000 4.450000 11.70000 17.50000
## PTS FGP ThreePTP FTP
## 1 67.82609 43.97354 32.85037 71.89765
## 2 69.95000 46.89440 36.66028 72.07123
These values will be listed in an organized set of tables under the conclusion tab.
This information was then all stored in a table for easy comparison between the two coaches and their respective season averages.
First, look at the variables identified as important to game outcome and compare these values:
Other notable observations are:
The analyst set out to dissect the data by date to see if the coaches had general trends over the course of the season and throughout conference play using the variables defined as important in the earlier analysis. All of these variables were looked at in terms of averages by month. The first month of play denoted by November. These visuals were all created in Tableau and screenshots are included.
First, start with Coach Archie Miller:
Over the last two seasons that Archie Miller was head coach, the following was observed:
Now, look at Coach Mike Woodson:
Over the last two seasons that Mike Woodson has been head coach, the following was observed:
In short, Coach Woodson appears to have a higher defensive rebound average throughout the season, than Coach Miller. However, the trends of the two coaches are opposite. The visuals display that Miller’s teams (starting with a lower average) generally increase their number of defensive rebounds, while Woodson’s teams (starting with a higher average) generally decreases their number of defensive rebounds throughout conference play. Lastly, it is easy to see that the two coaches have around the same minimum average value for defensive rebounds, but Woodson has a higher maximum average value for defensive rebounds per month.
First, start with Coach Archie Miller:
Over the last two seasons that Archie Miller was head coach, the following was observed:
Now, look at Coach Mike Woodson:
Over the last two seasons that Mike Woodson has been head coach, the following was observed:
In short, while both coaches lead the team to around the same best and worst average field goal percentages by month, the timing of each is different. Miller’s last two season teams saw an overall decrease in field goal percentage over the course of the season, while Woodson’s team generally saw an increase.
First, start with Coach Archie Miller:
Over the last two seasons that Archie Miller was head coach, the following was observed:
Now, look at Coach Mike Woodson:
Over the last two seasons that Mike Woodson has been head coach, the following was observed:
In short, while both coaches increase the number of steals as conference play continues, Miller’s last two season teams seem to be more consistent in this area of play. However, Mike Woodson’s teams (in the month of March) by far out due any of the monthly averages that Miller’s teams posted.
After looking at general trends over the course of conference play, the analyst then looked to derive opponent specific information. Similar as done before the analyst looked to derive averages per coach and per season.
Now, each of these were aggregated for comparison. First, by the two season span for Archie Miller.
AM_conf_avg <- aggregate(cbind(OREB, DREB, AST, STL, TO, FGP, ThreePTP) ~ opponent, data = IU_BS_confAM, FUN = mean, na.rm = TRUE)
AM_conf_avg
## opponent OREB DREB AST STL TO FGP
## 1 Illinois 8.666667 26.33333 11.00000 5.333333 11.00000 40.35796
## 2 Iowa 13.333333 25.00000 15.33333 8.333333 11.33333 43.63191
## 3 Maryland 12.000000 27.33333 13.00000 2.333333 10.00000 41.99510
## 4 Michigan 7.000000 17.00000 7.50000 2.500000 9.00000 42.18159
## 5 Michigan State 10.000000 22.33333 11.66667 7.333333 9.00000 40.73365
## 6 Minnesota 9.000000 27.00000 14.66667 4.333333 12.33333 51.02323
## 7 Nebraska 14.333333 33.33333 16.33333 3.666667 13.00000 48.01078
## 8 Northwestern 13.000000 26.00000 13.00000 8.333333 15.00000 40.19928
## 9 Ohio State 7.666667 21.33333 11.00000 6.000000 13.00000 42.68481
## 10 Penn State 8.000000 27.00000 10.33333 6.000000 14.00000 44.35626
## 11 Purdue 9.250000 20.50000 12.00000 5.750000 12.00000 37.72054
## 12 Rutgers 10.250000 24.25000 12.25000 5.750000 12.50000 37.02235
## 13 Wisconsin 8.000000 24.66667 13.33333 3.666667 10.66667 41.62329
## ThreePTP
## 1 46.29630
## 2 41.84224
## 3 29.25749
## 4 25.83333
## 5 21.46199
## 6 35.63492
## 7 29.42308
## 8 31.21693
## 9 47.22222
## 10 28.49168
## 11 23.14312
## 12 32.49269
## 13 37.93651
Next, the information was derived by each season Archie Miller was head coach.
AM_conf_avg_by_season <- aggregate(cbind(OREB, DREB, AST, STL, TO, FGP, ThreePTP) ~ opponent + season, data = IU_BS_confAM, FUN = mean, na.rm = TRUE)
AM_conf_avg_by_season
## opponent season OREB DREB AST STL TO FGP
## 1 Illinois 2020 12.0 27.00000 12.00000 4.000000 10.00000 40.67797
## 2 Iowa 2020 16.0 23.00000 16.00000 11.000000 17.00000 45.90164
## 3 Maryland 2020 12.0 25.50000 14.50000 1.500000 10.00000 44.34858
## 4 Michigan 2020 7.0 14.00000 7.00000 1.000000 7.00000 45.90164
## 5 Michigan State 2020 10.0 21.00000 12.00000 6.000000 8.00000 45.61404
## 6 Minnesota 2020 8.5 28.50000 14.50000 5.500000 10.00000 47.64595
## 7 Nebraska 2020 15.5 35.50000 17.50000 3.500000 15.50000 48.93925
## 8 Northwestern 2020 15.0 25.00000 11.00000 9.000000 16.00000 37.03704
## 9 Ohio State 2020 6.0 23.50000 10.00000 7.500000 12.00000 43.02721
## 10 Penn State 2020 10.0 29.50000 8.00000 6.500000 14.50000 37.96296
## 11 Purdue 2020 12.0 20.50000 10.50000 4.500000 13.50000 34.28049
## 12 Rutgers 2020 11.0 26.00000 6.00000 4.000000 16.00000 31.66667
## 13 Wisconsin 2020 11.0 22.00000 10.50000 4.500000 9.50000 38.24138
## 14 Illinois 2021 7.0 26.00000 10.50000 6.000000 11.50000 40.19796
## 15 Iowa 2021 12.0 26.00000 15.00000 7.000000 8.50000 42.49705
## 16 Maryland 2021 12.0 31.00000 10.00000 4.000000 10.00000 37.28814
## 17 Michigan 2021 7.0 20.00000 8.00000 4.000000 11.00000 38.46154
## 18 Michigan State 2021 10.0 23.00000 11.50000 8.000000 9.50000 38.29346
## 19 Minnesota 2021 10.0 24.00000 15.00000 2.000000 17.00000 57.77778
## 20 Nebraska 2021 12.0 29.00000 14.00000 4.000000 8.00000 46.15385
## 21 Northwestern 2021 12.0 26.50000 14.00000 8.000000 14.50000 41.78040
## 22 Ohio State 2021 11.0 17.00000 13.00000 3.000000 15.00000 42.00000
## 23 Penn State 2021 4.0 22.00000 15.00000 5.000000 13.00000 57.14286
## 24 Purdue 2021 6.5 20.50000 13.50000 7.000000 10.50000 41.16059
## 25 Rutgers 2021 10.0 23.66667 14.33333 6.333333 11.33333 38.80757
## 26 Wisconsin 2021 2.0 30.00000 19.00000 2.000000 13.00000 48.38710
## ThreePTP
## 1 50.00000
## 2 52.38095
## 3 34.79532
## 4 25.00000
## 5 33.33333
## 6 24.28571
## 7 25.38462
## 8 21.42857
## 9 54.16667
## 10 26.94805
## 11 27.08333
## 12 10.52632
## 13 37.85714
## 14 44.44444
## 15 36.57289
## 16 18.18182
## 17 26.66667
## 18 15.52632
## 19 58.33333
## 20 37.50000
## 21 36.11111
## 22 33.33333
## 23 31.57895
## 24 19.20290
## 25 39.81481
## 26 38.09524
These values will be listed in an organized set of tables under the conclusion tab.
Now, the same was done for Mike Woodson. First, starting with the overall averages for both seasons.
MW_conf_avg <- aggregate(cbind(OREB, DREB, AST, STL, TO, FGP, ThreePTP) ~ opponent, data = IU_BS_confMW, FUN = mean, na.rm = TRUE)
MW_conf_avg
## opponent OREB DREB AST STL TO FGP
## 1 Illinois 9.500000 27.50000 13.75000 5.000000 12.000000 46.97511
## 2 Iowa 10.000000 21.50000 17.75000 6.750000 14.250000 48.97580
## 3 Maryland 7.333333 28.00000 14.33333 4.333333 11.000000 46.80260
## 4 Michigan 7.500000 24.25000 14.00000 6.500000 10.250000 43.58259
## 5 Michigan State 8.666667 21.33333 12.33333 5.000000 11.000000 43.05701
## 6 Minnesota 7.666667 32.00000 16.66667 3.333333 8.666667 48.28042
## 7 Nebraska 8.000000 27.33333 14.00000 7.333333 14.666667 49.22807
## 8 Northwestern 7.333333 29.33333 14.00000 2.000000 12.000000 45.84628
## 9 Ohio State 13.666667 25.66667 14.33333 6.000000 9.666667 41.78620
## 10 Penn State 10.000000 22.33333 14.33333 5.000000 8.666667 44.90112
## 11 Purdue 5.750000 20.75000 11.25000 7.500000 6.500000 46.92774
## 12 Rutgers 9.000000 24.00000 11.00000 5.666667 12.333333 39.08730
## 13 Wisconsin 10.666667 29.33333 12.66667 3.000000 9.666667 42.15583
## ThreePTP
## 1 32.96620
## 2 32.41228
## 3 31.91142
## 4 36.57895
## 5 39.84127
## 6 37.89683
## 7 37.04429
## 8 34.09091
## 9 31.63743
## 10 43.00797
## 11 35.41667
## 12 31.41270
## 13 26.24644
Then, the aggregation by season was done to find season averages.
MW_conf_avg_by_season <- aggregate(cbind(OREB, DREB, AST, STL, TO, FGP, ThreePTP) ~ opponent + season, data = IU_BS_confMW, FUN = mean, na.rm = TRUE)
MW_conf_avg_by_season
## opponent season OREB DREB AST STL TO FGP ThreePTP
## 1 Illinois 2022 7.0 27.0 13.5 4.0 8.5 41.07143 26.53846
## 2 Iowa 2022 12.5 21.0 19.5 6.5 16.5 49.28122 29.06699
## 3 Maryland 2022 5.0 28.0 16.0 5.5 10.5 51.45390 34.23077
## 4 Michigan 2022 8.0 21.0 16.5 6.5 10.0 42.88642 39.82456
## 5 Michigan State 2022 12.0 22.0 14.0 6.0 11.0 33.89831 23.80952
## 6 Minnesota 2022 6.5 30.5 16.5 4.0 8.0 51.88492 42.55952
## 7 Nebraska 2022 8.5 27.0 10.0 7.0 14.5 47.17544 33.56643
## 8 Northwestern 2022 6.0 28.0 13.0 3.0 7.0 37.03704 25.00000
## 9 Ohio State 2022 13.0 26.5 13.0 6.5 10.0 37.67930 22.45614
## 10 Penn State 2022 8.5 21.0 14.0 7.0 7.5 45.31778 50.22624
## 11 Purdue 2022 6.0 26.0 11.5 7.5 6.0 43.09524 27.50000
## 12 Rutgers 2022 9.0 22.0 9.0 7.0 9.0 41.07143 28.57143
## 13 Wisconsin 2022 11.5 27.5 14.0 2.5 10.5 39.84664 33.11966
## 14 Illinois 2023 12.0 28.0 14.0 6.0 15.5 52.87879 39.39394
## 15 Iowa 2023 7.5 22.0 16.0 7.0 12.0 48.67037 35.75758
## 16 Maryland 2023 12.0 28.0 11.0 2.0 12.0 37.50000 27.27273
## 17 Michigan 2023 7.0 27.5 11.5 6.5 10.5 44.27876 33.33333
## 18 Michigan State 2023 7.0 21.0 11.5 4.5 11.0 47.63636 47.85714
## 19 Minnesota 2023 10.0 35.0 17.0 2.0 10.0 41.07143 28.57143
## 20 Nebraska 2023 7.0 28.0 22.0 8.0 15.0 53.33333 44.00000
## 21 Northwestern 2023 8.0 30.0 14.5 1.5 14.5 50.25090 38.63636
## 22 Ohio State 2023 15.0 24.0 17.0 5.0 9.0 50.00000 50.00000
## 23 Penn State 2023 13.0 25.0 15.0 1.0 11.0 44.06780 28.57143
## 24 Purdue 2023 5.5 15.5 11.0 7.5 7.0 50.76023 43.33333
## 25 Rutgers 2023 9.0 25.0 12.0 5.0 14.0 38.09524 32.83333
## 26 Wisconsin 2023 9.0 33.0 10.0 4.0 8.0 46.77419 12.50000
These values will be listed in an organized set of tables under the conclusion tab.
Conference Opponent Average Conclusions
Overall, the analyst found that 10/13 conference opponents saw better team averages in more game components (OREB, DREB, AST, STL, TO, FGP, and ThreePTP) from Woodson’s first two years in comparison to Archie Miller’s last two years. For example, the Hoosiers under Woodson outranked the Hoosiers under Miller in average field goal percentage against 11/13 of their conference opponents. Evidence like this shows that Woodson has the programming move in the right direction.
The analyst also looked at the shooting percentages between Woodson’s first and second year as head coach and found that each field goal percentage and three point percentage saw improvement 8/13 times for their conference opponents. Not only is Woodson improving from Miller’s averages but, he appears to have improved on his own averages too.
For more details on the relationships between the teams performance under a given coach and a specific opponent, navigate through the tabs below:
Illinois
The table for average difference between each coach and their respective seasons is pasted below:
To start, the analyst looked at the difference in the overall averages between coaches.
Overall aspects of the game against Illinois that were better for Archie Miller than for Mike Woodson include:
Overall aspects of the game against Illinois that are better for Mike Woodson than for Archie Miller include:
The analyst then looked at Coach Woodson’s averages and compared his first season against Illinois to his second. From this, it is clear that under Coach Woodson’s guidance the Hoosiers improved nearly all aspects of the game against Illinois. The only one that didn’t improve was the number of turnovers. However, more impressive is the increase in shooting averages by nearly 10% (both field goal percentage and three-point percentage). It appears that Woodson has a strong game plan against Illinois.
Iowa
The table for average difference between each coach and their respective seasons is pasted below:
To start, the analyst looked at the difference in the overall averages between coaches.
Overall aspects of the game against Iowa that were better for Archie Miller than for Mike Woodson include:
Overall aspects of the game against Iowa that are better for Mike Woodson than for Archie Miller include:
The analyst then looked at Coach Woodson’s averages and compared his first season against Iowa to his second. From this, it is clear that under Coach Woodson’s guidance the Hoosiers improved some aspects of the game against Iowa. These areas included defensive rebounds, steals, turn overs, and three point percentages. While field goal percentage didn’t increase, it stayed about the same under Woodson.
Maryland
The table for average difference between each coach and their respective seasons is pasted below:
To start, the analyst looked at the difference in the overall averages between coaches.
Overall aspects of the game against Maryland that were better for Archie Miller than for Mike Woodson include:
Overall aspects of the game against Maryland that are better for Mike Woodson than for Archie Miller include:
The analyst then looked at Coach Woodson’s averages and compared his first season against Maryland to his second. From this, it is clear that Coach Woodson has struggled against this opponent in some areas of play. In particular, large decreases can be found in the shooting percentages (field goal percentage down 14% and three point percentage down 6.9%).
Michigan
The table for average difference between each coach and their respective seasons is pasted below:
To start, the analyst looked at the difference in the overall averages between coaches.
Overall aspects of the game against Michigan that were better for Archie Miller than for Mike Woodson include:
Overall aspects of the game against Michigan that are better for Mike Woodson than for Archie Miller include:
The analyst then looked at Coach Woodson’s averages and compared his first season against Michigan to his second. The analyst found that while values may have increased or decreased, they all stand around the same amounts. For example, in year one the number of offensive rebounds were 8.0 and in his second year there were 7. Similarly, in his first year the field goal percentage was 42.9% and in his second year it was at 44.3%. An area of notable improvement is the number of defensive rebounds moving from 21.0 to 27.5. Overall, Woodson is developing against Michigan.
Michigan State
The table for average difference between each coach and their respective seasons is pasted below:
To start, the analyst looked at the difference in the overall averages between coaches.
Overall aspects of the game against Michigan State that were better for Archie Miller than for Mike Woodson include:
Overall aspects of the game against Michigan State that are better for Mike Woodson than for Archie Miller include:
The analyst then looked at Coach Woodson’s averages and compared his first season against Michigan State to his second. The analyst found that while many values decreased, the shooting percentages of the team actually increased over his two years. These increases were not small as they both increased by a value of more than 10%. Similar to Maryland though, it appears that in most aspects of the game Woodson struggled against Michigan State.
Minnesota
The table for average difference between each coach and their respective seasons is pasted below:
To start, the analyst looked at the difference in the overall averages between coaches.
Overall aspects of the game against Minnesota that were better for Archie Miller than for Mike Woodson include:
Overall aspects of the game against Minnesota that are better for Mike Woodson than for Archie Miller include:
The analyst then looked at Coach Woodson’s averages and compared his first season against Minnesota to his second. The analyst found a general increase in all varialbes except, the shooting percentages, turn overs, and steals. While the differences in steals and turn overs were generally small, the differences in shooting percentages was nearly a 10% decrease. However, even with this being the case Woodson has still amounted a 3-0 record against Minnesota. Thus, it as not as concerning.
Nebraska
The table for average difference between each coach and their respective seasons is pasted below:
To start, the analyst looked at the difference in the overall averages between coaches.
Overall aspects of the game against Nebraska that were better for Archie Miller than for Mike Woodson include:
Overall aspects of the game against Nebraska that are better for Mike Woodson than for Archie Miller include:
The analyst then looked at Coach Woodson’s averages and compared his first season against Nebraska to his second. The analyst found that nearly all components of the game improved from his first season to his second season. The largest changes coming from assists (up 12) and three-point percentage (up 10%). Both coaches have a 3-0 record against this team. Thus, any values that Miller has posted better are less of a concern as they have still shown to be effective against this given Nebraska.
Northwestern
The table for average difference between each coach and their respective seasons is pasted below:
To start, the analyst looked at the difference in the overall averages between coaches.
Overall aspects of the game against Northwestern that were better for Archie Miller than for Mike Woodson include:
Overall aspects of the game against Northwestern that are better for Mike Woodson than for Archie Miller include:
The analyst then looked at Coach Woodson’s averages and compared his first season against Northwestern to his second. The analyst found that nearly all components of the game improved from his first season as coach to his second season as coach. The largest changes coming from average field goal percentage (up 13.3%) and three-point percentage (up 13.6%). Given this, Woodson is improving the level of play against Northwestern.
Ohio State
The table for average difference between each coach and their respective seasons is pasted below:
To start, the analyst looked at the difference in the overall averages between coaches.
Overall aspects of the game against Ohio State that were better for Archie Miller than for Mike Woodson include:
Overall aspects of the game against Ohio State that are better for Mike Woodson than for Archie Miller include:
The analyst then looked at Coach Woodson’s averages and compared his first season against Ohio State to his second. The analyst found that nearly all components of the game improved from his first season as coach to his second season as coach. The largest changes coming from average field goal percentage (up 12.3%) and three-point percentage (up 27.5%). While his first year against Ohio State may have been rough, Woodson appears to have found an approach in his second season to help his team be more successful.
Penn State
The table for average difference between each coach and their respective seasons is pasted below:
To start, the analyst looked at the difference in the overall averages between coaches.
Overall aspects of the game against Penn State that were better for Archie Miller than for Mike Woodson include:
Overall aspects of the game against Penn State that are better for Mike Woodson than for Archie Miller include:
The analyst then looked at Coach Woodson’s averages and compared his first season against Penn State to his second. The analyst found that some components of the game improved, while others fell short of the season before. The largest changes coming from an improvement in average offensive rebounds (up 4) and a decease in three-point percentage by nearly 21.6%. However, even with the large decrease in three point percentage, the field goal percentage still posted at about the same value at 45.3% the first year and 44.1% in the second. Thus, while there are some concerns about competing against Penn State, it appears that Woodson is still holding the team at a competitive level against them.
Purdue
The table for average difference between each coach and their respective seasons is pasted below:
To start, the analyst looked at the difference in the overall averages between coaches.
Overall aspects of the game against Purdue that were better for Archie Miller than for Mike Woodson include:
Overall aspects of the game against Purdue that are better for Mike Woodson than for Archie Miller include:
The analyst then looked at Coach Woodson’s averages and compared his first season against Purdue to his second. The analyst found that some components of the game improved, while others fell short of the season before. The most notable differences occur in the shooting percentages. Field goal percentage improving by about 7%, while three point percentage saw a lofty improvement of 15.8%. With such large increases in shooting percentages it does not come as a surprise that rebounds were down in the second year. It appears that Coach Woodson has improved the program against Purdue.
Rutgers
The table for average difference between each coach and their respective seasons is pasted below:
To start, the analyst looked at the difference in the overall averages between coaches.
Overall aspects of the game against Rutgers that were better for Archie Miller than for Mike Woodson include:
Overall aspects of the game against Rutgers that are better for Mike Woodson than for Archie Miller include:
The analyst then looked at Coach Woodson’s averages and compared his first season against Rutgers to his second. The analyst found that some components of the game improved, while others fell short of the season before. A notable difference is that both shooting percentages decreased in his second year. However, neither field goal percentage nor three point percentage saw decreases more than 5%. Other areas also saw some improvement like defensive rebounds and assists. All in all, there are some concerns about Coach Woodson’s performance against Rutgers.
Wisconsin
The table for average difference between each coach and their respective seasons is pasted below:
To start, the analyst looked at the difference in the overall averages between coaches.
Overall aspects of the game against Wisconsin that were better for Archie Miller than for Mike Woodson include:
Overall aspects of the game against Wisonsin that are better for Mike Woodson than for Archie Miller include:
The analyst then looked at Coach Woodson’s averages and compared his first season against Wisconsin to his second. The analyst found that some components of the game improved, while others fell short of the season before. One area worth noting is both shooting percentages. Field goal percentages increased by nearly 7% between the first and second season. However, three point percentage fell more than 20% between the two years. There were other areas that saw good improvement like defensive rebounds and turnovers. With all of this being taken into consideration, there are still some concerns about Coach Woodson’s performance against Wisconsin.
The analyst then looked to cluster the average team performance by opponent within their conference. This was done for each coach so that the cluster results between the two coaches could be compared. As demonstrated below, the analyst used k-means clustering. First, the process was done for Archie Miller’s two season as head coach. The data that is being used to generate the clustering was printed for reference of team identification within each cluster.
AM_conf_avg
## opponent OREB DREB AST STL TO FGP
## 1 Illinois 8.666667 26.33333 11.00000 5.333333 11.00000 40.35796
## 2 Iowa 13.333333 25.00000 15.33333 8.333333 11.33333 43.63191
## 3 Maryland 12.000000 27.33333 13.00000 2.333333 10.00000 41.99510
## 4 Michigan 7.000000 17.00000 7.50000 2.500000 9.00000 42.18159
## 5 Michigan State 10.000000 22.33333 11.66667 7.333333 9.00000 40.73365
## 6 Minnesota 9.000000 27.00000 14.66667 4.333333 12.33333 51.02323
## 7 Nebraska 14.333333 33.33333 16.33333 3.666667 13.00000 48.01078
## 8 Northwestern 13.000000 26.00000 13.00000 8.333333 15.00000 40.19928
## 9 Ohio State 7.666667 21.33333 11.00000 6.000000 13.00000 42.68481
## 10 Penn State 8.000000 27.00000 10.33333 6.000000 14.00000 44.35626
## 11 Purdue 9.250000 20.50000 12.00000 5.750000 12.00000 37.72054
## 12 Rutgers 10.250000 24.25000 12.25000 5.750000 12.50000 37.02235
## 13 Wisconsin 8.000000 24.66667 13.33333 3.666667 10.66667 41.62329
## ThreePTP
## 1 46.29630
## 2 41.84224
## 3 29.25749
## 4 25.83333
## 5 21.46199
## 6 35.63492
## 7 29.42308
## 8 31.21693
## 9 47.22222
## 10 28.49168
## 11 23.14312
## 12 32.49269
## 13 37.93651
Now, the k-means clustering model can be made using only the numeric columns of data. The analyst decided that three clusters was appropriate, in hopes of having a high performance cluster, mid performance cluster, and a low performance cluster. The code to do this can be seen in the window below:
set.seed(1234)
#fit the model
fit <- kmeans(AM_conf_avg[, 2:8], 3)#3 is the number of clusters we want to build
fit
## K-means clustering with 3 clusters of sizes 6, 3, 4
##
## Cluster means:
## OREB DREB AST STL TO FGP ThreePTP
## 1 11.097222 27.48611 13.26389 5.069444 12.80556 43.76783 31.08613
## 2 8.750000 19.94444 10.38889 5.194444 10.00000 40.21193 23.47948
## 3 9.416667 24.33333 12.66667 5.833333 11.50000 42.07449 43.32432
##
## Clustering vector:
## [1] 3 3 1 2 2 1 1 1 3 1 2 1 3
##
## Within cluster sum of squares by cluster:
## [1] 302.47030 70.36058 123.28281
## (between_SS / total_SS = 64.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Once the clusters have been made, it was time to analyze the results of the k-mean model. The clusters appear to follow the general pattern the analyst was hoping for. Below are the results of the clusters found from Indiana team performance averages during Archie Miller’s last two years as head coach:
Now, same was done for Mike Woodson’s first two years as head coach. The dataset being used will be printed, so it is available for reference.
MW_conf_avg
## opponent OREB DREB AST STL TO FGP
## 1 Illinois 9.500000 27.50000 13.75000 5.000000 12.000000 46.97511
## 2 Iowa 10.000000 21.50000 17.75000 6.750000 14.250000 48.97580
## 3 Maryland 7.333333 28.00000 14.33333 4.333333 11.000000 46.80260
## 4 Michigan 7.500000 24.25000 14.00000 6.500000 10.250000 43.58259
## 5 Michigan State 8.666667 21.33333 12.33333 5.000000 11.000000 43.05701
## 6 Minnesota 7.666667 32.00000 16.66667 3.333333 8.666667 48.28042
## 7 Nebraska 8.000000 27.33333 14.00000 7.333333 14.666667 49.22807
## 8 Northwestern 7.333333 29.33333 14.00000 2.000000 12.000000 45.84628
## 9 Ohio State 13.666667 25.66667 14.33333 6.000000 9.666667 41.78620
## 10 Penn State 10.000000 22.33333 14.33333 5.000000 8.666667 44.90112
## 11 Purdue 5.750000 20.75000 11.25000 7.500000 6.500000 46.92774
## 12 Rutgers 9.000000 24.00000 11.00000 5.666667 12.333333 39.08730
## 13 Wisconsin 10.666667 29.33333 12.66667 3.000000 9.666667 42.15583
## ThreePTP
## 1 32.96620
## 2 32.41228
## 3 31.91142
## 4 36.57895
## 5 39.84127
## 6 37.89683
## 7 37.04429
## 8 34.09091
## 9 31.63743
## 10 43.00797
## 11 35.41667
## 12 31.41270
## 13 26.24644
Now, the analyst must generate the model, adapting the previous code for the new data set. Understand that the analyst still aims to create 3 clusters for this model.
set.seed(1234)
#fit the model
fit <- kmeans(MW_conf_avg[, 2:8], 3)#3 is the number of clusters we want to build
fit
## K-means clustering with 3 clusters of sizes 3, 4, 6
##
## Cluster means:
## OREB DREB AST STL TO FGP ThreePTP
## 1 11.111111 26.33333 12.66667 4.888889 10.555556 41.00978 29.76552
## 2 7.979167 22.16667 12.97917 6.000000 9.104167 44.61711 38.71121
## 3 8.305556 27.61111 15.08333 4.791667 12.097222 47.68471 34.38699
##
## Clustering vector:
## [1] 3 3 3 2 2 3 3 3 1 2 2 1 1
##
## Within cluster sum of squares by cluster:
## [1] 65.99112 83.55141 166.08657
## (between_SS / total_SS = 53.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Once the clusters have been made, it was time to analyze the results of the k-mean model. The clusters appear to follow the general pattern the analyst was hoping for. Below are the results of the clusters found from Indiana team performance averages during Mike Woodson’s first two years as head coach:
After observing each individual coaching trends, it was then time to compare them against each other. First, it is important to notice the overall shooting percentage in Woodson’s clustering are higher than the shooting percentage averages used for Miller’s clustering. The maximum FGP in Miller’s was 43.8%, while Woodson’s was 47.7%. However, Miller had the highest three point average with a cluster at 43.3% compared to Woodson’s cluster average at 38.7%. Miller also posts the lowest three point average at 23.5% in comparison to Woodson’s 29.8%. Thus the range of Woodson’s values appear to more narrow than that of Miller. This does provide evidence that on average, Woodson’s teams, play consistently better than Miller’s teams.
After looking at the difference in averages for the clusters, the teams were then compared within each cluster to see any similarities in opponents between the two coaches.
The analyst then looked to build a decision tree model to predict the average field goal percentage for the Hoosiers under Coach Woodson for the 2023-24 season. This model will be built using all of the data for the past two seasons of play and the prediction will come from the average values of the most recent season.
First, start by getting the data ready for the decision tree:
IU_BS_confMW <- IU_BS_confMW[,-(10:13)]
IU_BS_confMW <- IU_BS_confMW[,-4]
IU_BS_confMW <- IU_BS_confMW[,-1]
Now, build the decision tree model:
rpart2024 <- rpart(formula = FGP ~ ., data = IU_BS_confMW)
prp(rpart2024, digits =4, extra = 1)
Now, use the averages from the past season to forecast where the field goal percentage for Mike Woodson will be for his next season.
summary(IU_BS_confMW)
## OREB DREB AST STL
## Min. : 2.000 Min. :14.00 Min. : 6.00 Min. : 1.000
## 1st Qu.: 7.000 1st Qu.:22.50 1st Qu.:11.00 1st Qu.: 4.000
## Median : 9.000 Median :25.00 Median :14.00 Median : 5.000
## Mean : 8.791 Mean :25.44 Mean :13.91 Mean : 5.302
## 3rd Qu.:10.500 3rd Qu.:29.00 3rd Qu.:16.00 3rd Qu.: 7.000
## Max. :15.000 Max. :35.00 Max. :22.00 Max. :11.000
## BLK TO PF FGP
## Min. : 1.000 Min. : 3.00 Min. :10.0 Min. :30.36
## 1st Qu.: 3.000 1st Qu.: 9.00 1st Qu.:15.0 1st Qu.:40.98
## Median : 4.000 Median :10.00 Median :18.0 Median :45.16
## Mean : 4.442 Mean :10.81 Mean :17.3 Mean :45.33
## 3rd Qu.: 6.000 3rd Qu.:13.00 3rd Qu.:20.0 3rd Qu.:49.44
## Max. :10.000 Max. :23.00 Max. :25.0 Max. :61.82
## ThreePTP FTP
## Min. :12.50 Min. : 46.15
## 1st Qu.:26.79 1st Qu.: 66.67
## Median :31.58 Median : 71.43
## Mean :34.62 Mean : 71.98
## 3rd Qu.:40.83 3rd Qu.: 78.17
## Max. :76.92 Max. :100.00
Based on the decision tree and the current trends of Mike Woodson, the field goal percentage next season should be around the 47% mark for the Hoosiers. Understand that there are limitations to this prediction, as the model was not evaluated for accuracy (since there are only 40 rows of data). However, it is still a useful and realistic benchmark for Woodson to aim for within the next season.
Player Development
After gathering general team information, the analyst set out to compare individual player development between the two different coaches within the two year time span of data at hand. The analyst decided that the following areas would be looked into further to see if player development was occurring for each of the coaches:
Manipulating the Data for Analysis
Filter the data to only display individual information.
player_BS <- filter(IU_BoxScores, player != "TEAM")
After filtering to remove rows consisting of team information, the data was then filtered to only include conference opponent games. Since there was consistency in these opponents across both coaches, this acts as a control group.
player_BS_conf <- filter(player_BS, opponent == "Illinos" | opponent == "Iowa" | opponent == "Maryland" | opponent == "Michigan" | opponent == "Michigan State" | opponent == "Minnesota" | opponent == "Nebraska" | opponent == "Northwestern" | opponent == "Ohio State" | opponent == "Penn State" | opponent == "Purdue" | opponent == "Rutgers" | opponent == "Wisconsin" | opponent == "Illinois")
Players Averaging Over 15 Minutes Per Game
After the data was filtered to only include conference games, the analyst decided to only look at players who averaged 15 minutes of play or more during each coaches two year period. This will narrow the focus group and will remove any outliers from the data. This was done by grouping the data by each player and finding the average number of minutes played.
#creating a dateset for conference play for Archie Miller
players_confAM <- filter(player_BS_conf, season == "2020" | season == "2021")
#grouping by players
playersAM <- group_by(players_confAM, player)
#find average minutes overall
players_confAM_MIN <- summarise(playersAM, type = mean(MIN, na.rm = TRUE))
arrange(players_confAM_MIN, desc(type))
## # A tibble: 17 × 2
## player type
## <chr> <dbl>
## 1 Trayce Jackson-Davis 32.5
## 2 Justin Smith 31
## 3 Al Durham 29.8
## 4 Rob Phinisee 26.9
## 5 Race Thompson 21.8
## 6 Devonte Green 20.6
## 7 Joey Brunk 19.0
## 8 Trey Galloway 18.8
## 9 Armaan Franklin 18.7
## 10 Jerome Hunter 18.0
## 11 Anthony Leal 11.3
## 12 Damezi Anderson 9.88
## 13 De'Ron Davis 9.45
## 14 Khristian Lander 9.42
## 15 Jordan Geronimo 7.76
## 16 Cooper Bybee 0.5
## 17 Nathan Childress 0.5
For conference games, the following players were found to average over the 15 minute mark:
After discovering this, the data was then filtered one more time so that only these players would be included in the analysis.
playerAM_conf15 <- filter(players_confAM, player == "Trayce Jackson-Davis" | player == "Justin Smith" | player == "Al Durham" | player == "Rob Phinisee" | player == "Race Thompson" | player == "Devonte Green" | player == "Joey Brunk" | player == "Trey Galloway" | player == "Armaan Franklin" | player == "Jerome Hunter")
Then, the same was done for the two seasons that Woodson was head coach.
#creating a dateset for conference play for Archie Miller
players_confMW <- filter(player_BS_conf, season == "2022" | season == "2023")
#grouping by players
playersMW <- group_by(players_confMW, player)
#find average minutes overall
players_confMW_MIN <- summarise(playersMW, type = mean(MIN, na.rm = TRUE))
arrange(players_confMW_MIN, desc(type))
## # A tibble: 21 × 2
## player type
## <chr> <dbl>
## 1 Jalen Hood-Schifino 34.7
## 2 Trayce Jackson-Davis 34.6
## 3 Xavier Johnson 29.4
## 4 Miller Kopp 28.5
## 5 Trey Galloway 27.1
## 6 Race Thompson 26.9
## 7 Parker Stewart 24.6
## 8 Rob Phinisee 18.7
## 9 Tamar Bates 15.7
## 10 Malik Reneau 14.8
## # … with 11 more rows
For conference games, the following players were found to average over the 15 minute mark:
After discovering this, the data was then filtered one more time so that only these players would be included in the analysis.
playerMW_conf15 <- filter(players_confMW, player == "Jalen Hood-Schifino" | player == "Trayce Jackson-Davis" | player == "Xavier Johnson" | player == "Miller Kopp" | player == "Trey Galloway" | player == "Race Thompson" | player == "Parker Stewart" | player == "Rob Phinisee" | player == "Tamar Bates")
FGP
Since some games may have 0 posted attempts for field goals, the analyst has to look at each shooting percentage by itself to ensure that no undefined values are derived. First, start by creating a data set with just season, player, position, FGM and FGA for Archie Miller.
columns <- c(1:3, 5:6)
playerAM_conf15_FG <- playerAM_conf15[,columns]
playerAM_conf15_FG <- filter(playerAM_conf15_FG, FGA != 0)
Now, create the new variable FGP with the following code:
playerAM_conf15_FG$FGP <- (playerAM_conf15_FG$FGM / playerAM_conf15_FG$FGA) * 100
Now, do the same for Mike Woodson. Starting by filtering the data accordingly.
columns <- c(1:3, 5:6)
playerMW_conf15_FG <- playerMW_conf15[,columns]
playerMW_conf15_FG <- filter(playerMW_conf15_FG, FGA != 0)
Then, generate the new variable:
playerMW_conf15_FG$FGP <- (playerMW_conf15_FG$FGM / playerMW_conf15_FG$FGA) * 100
Now, aggregate the data in a couple different ways. First find the mean of each position by coach. The table below shows information for Coach Miller.
aggregate(cbind(FGP) ~ position, data = playerAM_conf15_FG, FUN = mean, na.rm = TRUE)
## position FGP
## 1 F 44.17147
## 2 G 33.33301
Then, the same was done for Coach Woodson.
aggregate(cbind(FGP) ~ position, data = playerMW_conf15_FG, FUN = mean, na.rm = TRUE)
## position FGP
## 1 F 49.93120
## 2 G 35.54146
As shown above, Woodson’s by position field goal percentage average is higher than Miller’s. Now, the analyst looked to take this a step further by looking at position by season.
aggregate(cbind(FGP) ~ season + position, data = playerAM_conf15_FG, FUN = mean, na.rm = TRUE)
## season position FGP
## 1 2020 F 43.06275
## 2 2021 F 45.94923
## 3 2020 G 32.56220
## 4 2021 G 34.15736
Both positions for Miller saw increase over the two year span. Now, see if the same holds true for Woodson.
aggregate(cbind(FGP) ~ season + position, data = playerMW_conf15_FG, FUN = mean, na.rm = TRUE)
## season position FGP
## 1 2022 F 49.29014
## 2 2023 F 50.73846
## 3 2022 G 34.59260
## 4 2023 G 37.08536
Similarly, both positions for Woodson saw growth over his two years as coach. It is clear to see that with Woodson as coach the average FGP has been higher regardless of position and season. Next, the analyst wanted to look into individual player growth. First, look at Archie Miller’s player development.
aggregate(cbind(FGP) ~ season + player, data = playerAM_conf15_FG, FUN = mean, na.rm = TRUE)
## season player FGP
## 1 2020 Al Durham 39.32738
## 2 2021 Al Durham 34.61451
## 3 2020 Armaan Franklin 26.06481
## 4 2021 Armaan Franklin 38.98416
## 5 2020 Devonte Green 29.66818
## 6 2020 Jerome Hunter 29.92997
## 7 2021 Jerome Hunter 43.59127
## 8 2020 Joey Brunk 45.62500
## 9 2020 Justin Smith 46.86273
## 10 2020 Race Thompson 39.47917
## 11 2021 Race Thompson 44.65602
## 12 2020 Rob Phinisee 34.64271
## 13 2021 Rob Phinisee 29.69719
## 14 2020 Trayce Jackson-Davis 50.73025
## 15 2021 Trayce Jackson-Davis 49.36461
## 16 2021 Trey Galloway 34.60784
Of the 10 players listed, four only had data available for one season. Thus, six players could be compared from one season to the next. In doing this, the analyst discovered that there were three players (of the six) who actually saw a decrease in their in-conference FGP between the 2019-20 season and the 2020-21 season. Looking at the chart they were found to be: Al Durham, Rob Phinisee, and Trayce Jackson-Davis. However, understand and proceed with caution as some players experienced an injury in the latter season.
Now, the same was evaluated for Coach Woodson.
aggregate(cbind(FGP) ~ season + player, data = playerMW_conf15_FG, FUN = mean, na.rm = TRUE)
## season player FGP
## 1 2023 Jalen Hood-Schifino 40.54527
## 2 2022 Miller Kopp 36.17637
## 3 2023 Miller Kopp 50.71537
## 4 2022 Parker Stewart 33.09163
## 5 2022 Race Thompson 53.47147
## 6 2023 Race Thompson 47.94218
## 7 2022 Rob Phinisee 26.81490
## 8 2022 Tamar Bates 27.93891
## 9 2023 Tamar Bates 25.73517
## 10 2022 Trayce Jackson-Davis 57.65243
## 11 2023 Trayce Jackson-Davis 52.71895
## 12 2022 Trey Galloway 48.72222
## 13 2023 Trey Galloway 46.30357
## 14 2022 Xavier Johnson 38.46749
## 15 2023 Xavier Johnson 21.59091
Of the 9 players listed, three only had data available for one season. Thus, six players could be compared from one season to the next. In doing this, the analyst discovered that there were four players (of the six) who actually saw a decrease in their in-conference FGP between the 2021-22 season and the 2022-23 season. Looking at the chart they were found to be: Race Thompson, Tamar Bates, Trey Galloway, and Xavier Johnson. However, understand and proceed with caution as some players experienced an injury in the latter season. While these findings may not be directly due to coaching, it is one aspect to consider when evaluating the coach.
ThreePTP
Since some games may have 0 posted attempts for three pointers, the analyst has to look at each shooting percentage by itself to ensure that no undefined values are derived. First start by creating a data set with just season, player, position, ThreePTM and ThreePTA for Archie Miller.
columns <- c(1:3, 7:8)
playerAM_conf15_3P <- playerAM_conf15[,columns]
playerAM_conf15_3P <- filter(playerAM_conf15_3P, ThreePTA != 0)
Now, create the new variable ThreePTP with the following code:
playerAM_conf15_3P$ThreePTP <- (playerAM_conf15_3P$ThreePTM / playerAM_conf15_3P$ThreePTA) * 100
Now, do the same for Mike Woodson. Starting by filtering the data accordingly.
columns <- c(1:3, 7:8)
playerMW_conf15_3P <- playerMW_conf15[,columns]
playerMW_conf15_3P <- filter(playerMW_conf15_3P, ThreePTA != 0)
Then, generate the new variable:
playerMW_conf15_3P$ThreePTP <- (playerMW_conf15_3P$ThreePTM / playerMW_conf15_3P$ThreePTA) * 100
Now, aggregate the data in a couple different ways. First find the mean of each position by coach. The table below shows information for Coach Miller.
aggregate(cbind(ThreePTP) ~ position, data = playerAM_conf15_3P, FUN = mean, na.rm = TRUE)
## position ThreePTP
## 1 F 27.03125
## 2 G 29.04295
Then, the same was done for Coach Woodson.
aggregate(cbind(ThreePTP) ~ position, data = playerMW_conf15_3P, FUN = mean, na.rm = TRUE)
## position ThreePTP
## 1 F 35.75964
## 2 G 29.84594
As shown above, Woodson’s by position three point percentage average is higher than Miller’s. In fact, forwards were much more effective under Woodson with a nearly 8% difference in three point shots. Now, the analyst looked to take this a step further by looking at position by season.
aggregate(cbind(ThreePTP) ~ season + position, data = playerAM_conf15_3P, FUN = mean, na.rm = TRUE)
## season position ThreePTP
## 1 2020 F 27.47748
## 2 2021 F 26.41975
## 3 2020 G 29.75953
## 4 2021 G 28.25036
Both positions for Miller saw a decrease over the two year span. Now, see if the same holds true for Woodson.
aggregate(cbind(ThreePTP) ~ season + position, data = playerMW_conf15_3P, FUN = mean, na.rm = TRUE)
## season position ThreePTP
## 1 2022 F 34.30786
## 2 2023 F 38.07172
## 3 2022 G 28.06368
## 4 2023 G 32.63702
Conversely, both positions for Woodson saw growth over his two years as coach. It is clear to see that Woodson has helped the team be better three point shooters in his first two seasons. Next, the analyst wanted to look into individual player growth. First, look at Archie Miller’s player development.
aggregate(cbind(ThreePTP) ~ season + player, data = playerAM_conf15_3P, FUN = mean, na.rm = TRUE)
## season player ThreePTP
## 1 2020 Al Durham 37.10526
## 2 2021 Al Durham 40.07143
## 3 2020 Armaan Franklin 19.33333
## 4 2021 Armaan Franklin 36.90476
## 5 2020 Devonte Green 32.03896
## 6 2020 Jerome Hunter 30.20833
## 7 2021 Jerome Hunter 33.13725
## 8 2020 Justin Smith 22.22222
## 9 2020 Race Thompson 33.33333
## 10 2021 Race Thompson 15.00000
## 11 2020 Rob Phinisee 28.24561
## 12 2021 Rob Phinisee 18.23308
## 13 2021 Trey Galloway 15.38462
Of the 8 players listed, three only had data available for one season. Thus, five players could be compared from one season to the next. In doing this the analyst discovered that there were two players (of the five) who actually saw a decrease in their in-conference ThreePTP between the 2019-20 season and the 2020-21 season. Looking at the chart, they were found to be: Race Thompson and Rob Phinisee. However, understand and proceed with caution as some players experienced an injury in the latter season.
Now, the same was evaluated for Coach Woodson.
aggregate(cbind(ThreePTP) ~ season + player, data = playerMW_conf15_3P, FUN = mean, na.rm = TRUE)
## season player ThreePTP
## 1 2023 Jalen Hood-Schifino 27.48599
## 2 2022 Miller Kopp 35.99567
## 3 2023 Miller Kopp 43.89683
## 4 2022 Parker Stewart 31.73521
## 5 2022 Race Thompson 35.96491
## 6 2023 Race Thompson 21.42857
## 7 2022 Rob Phinisee 17.97052
## 8 2022 Tamar Bates 27.85714
## 9 2023 Tamar Bates 28.80208
## 10 2022 Trayce Jackson-Davis 0.00000
## 11 2022 Trey Galloway 25.00000
## 12 2023 Trey Galloway 43.14815
## 13 2022 Xavier Johnson 32.34848
## 14 2023 Xavier Johnson 12.50000
Of the 9 players listed, four only had data available for one season. Thus, five players could be compared from one season to the next. In doing this the analyst discovered that there were two players (of the five) who actually saw a decrease in their in-conference ThreePTP between the 2021-22 season and the 2022-23 season. Looking at the chart, they were found to be: Race Thompson and Xavier Johnson. However, understand and proceed with caution as some players experienced an injury in the latter season. While these findings may not be directly due to coaching, it is one aspect to consider when evaluating the coach.
PTS
Since points do not have the ability to be undefined, the analyst just needed to find the averages in a variety of ways. First find the mean of each position by coach. The table below shows information for Coach Miller.
aggregate(cbind(PTS) ~ position, data = playerAM_conf15, FUN = mean, na.rm = TRUE)
## position PTS
## 1 F 8.889610
## 2 G 7.585526
Then, the same was done for Coach Woodson.
aggregate(cbind(PTS) ~ position, data = playerMW_conf15, FUN = mean, na.rm = TRUE)
## position PTS
## 1 F 11.854839
## 2 G 7.576923
As shown above, Woodson’s forward points by position has a higher average than Miller’s. However, the guards between both coaches appear to contribute about the same number of points per conference game. Now, the analyst looked to take this a step further by looking at position by season.
aggregate(cbind(PTS) ~ season + position, data = playerAM_conf15, FUN = mean, na.rm = TRUE)
## season position PTS
## 1 2020 F 7.291667
## 2 2021 F 11.534483
## 3 2020 G 7.037975
## 4 2021 G 8.178082
Both positions for Miller saw an increase over the two year span. Now, see if the same holds true for Woodson.
aggregate(cbind(PTS) ~ season + position, data = playerMW_conf15, FUN = mean, na.rm = TRUE)
## season position PTS
## 1 2022 F 11.36232
## 2 2023 F 12.47273
## 3 2022 G 6.84375
## 4 2023 G 8.75000
Similarly, both positions for Woodson saw growth over his two years as coach. Next, the analyst wanted to look into individual player growth. First, look at Archie Miller’s player development.
aggregate(cbind(PTS) ~ season + player, data = playerAM_conf15, FUN = mean, na.rm = TRUE)
## season player PTS
## 1 2020 Al Durham 9.200000
## 2 2021 Al Durham 11.700000
## 3 2020 Armaan Franklin 2.050000
## 4 2021 Armaan Franklin 11.133333
## 5 2020 Devonte Green 9.550000
## 6 2020 Jerome Hunter 3.736842
## 7 2021 Jerome Hunter 7.000000
## 8 2020 Joey Brunk 6.100000
## 9 2020 Justin Smith 9.500000
## 10 2020 Race Thompson 3.588235
## 11 2021 Race Thompson 8.700000
## 12 2020 Rob Phinisee 7.368421
## 13 2021 Rob Phinisee 7.050000
## 14 2020 Trayce Jackson-Davis 12.800000
## 15 2021 Trayce Jackson-Davis 18.450000
## 16 2021 Trey Galloway 3.055556
Of the 10 players listed four only had data available for one season. Thus, six players could be compared from one season to the next. In doing this, the analyst discovered that there was only one players (of the six) who actually saw a decrease in their in-conference PTS between the 2019-20 season and the 2020-21 season. Looking at the chart, it was found to be: Rob Phinisee. However, understand and proceed with caution as some players experienced an injury in the latter season.
Now, the same was evaluated for Coach Woodson.
aggregate(cbind(PTS) ~ season + player, data = playerMW_conf15, FUN = mean, na.rm = TRUE)
## season player PTS
## 1 2023 Jalen Hood-Schifino 14.444444
## 2 2022 Miller Kopp 5.304348
## 3 2023 Miller Kopp 7.700000
## 4 2022 Parker Stewart 5.863636
## 5 2022 Race Thompson 11.826087
## 6 2023 Race Thompson 6.933333
## 7 2022 Rob Phinisee 5.000000
## 8 2022 Tamar Bates 3.000000
## 9 2023 Tamar Bates 4.800000
## 10 2022 Trayce Jackson-Davis 16.956522
## 11 2023 Trayce Jackson-Davis 21.400000
## 12 2022 Trey Galloway 6.400000
## 13 2023 Trey Galloway 7.850000
## 14 2022 Xavier Johnson 13.136364
## 15 2023 Xavier Johnson 6.000000
Of the 9 players listed, three only had data available for one season. Thus, six players could be compared from one season to the next. In doing this the analyst discovered that there were two players (of the six) who actually saw a decrease in their in-conference PTS between the 2021-22 season and the 2022-23 season. Looking at the chart, they were found to be: Race Thompson and Xavier Johnson. However, understand and proceed with caution as some players experienced an injury in the latter season. While these findings may not be directly due to coaching, it is one aspect to consider when evaluating the coach.
AST
Since assists do not have the ability to be undefined, the analyst just needed to find the averages in a variety of ways. First, find the mean of each position by coach. The table below shows information for Coach Miller.
aggregate(cbind(AST) ~ position, data = playerAM_conf15, FUN = mean, na.rm = TRUE)
## position AST
## 1 F 0.8766234
## 2 G 2.1052632
Then, the same was done for Coach Woodson.
aggregate(cbind(AST) ~ position, data = playerMW_conf15, FUN = mean, na.rm = TRUE)
## position AST
## 1 F 1.774194
## 2 G 2.134615
As shown above, Woodson’s forwards had a higher average in assists than Miller’s. However, the guards between both coaches appear to contribute about the same number of assists per conference game. Now, the analyst looked to take this a step further by looking at position by season.
aggregate(cbind(AST) ~ season + position, data = playerAM_conf15, FUN = mean, na.rm = TRUE)
## season position AST
## 1 2020 F 0.7083333
## 2 2021 F 1.1551724
## 3 2020 G 1.9746835
## 4 2021 G 2.2465753
Both positions for Miller saw an increase over the two year span. Now, see if the same holds true for Woodson.
aggregate(cbind(AST) ~ season + position, data = playerMW_conf15, FUN = mean, na.rm = TRUE)
## season position AST
## 1 2022 F 1.376812
## 2 2023 F 2.272727
## 3 2022 G 2.177083
## 4 2023 G 2.066667
Similarly, forwards for Woodson saw growth over his two years as coach. However, the guards saw a slight decrease in the number of assists during conference games. Next, the analyst wanted to look into individual player growth. First, look at Archie Miller’s player development.
aggregate(cbind(AST) ~ season + player, data = playerAM_conf15, FUN = mean, na.rm = TRUE)
## season player AST
## 1 2020 Al Durham 2.1500000
## 2 2021 Al Durham 2.4000000
## 3 2020 Armaan Franklin 0.7000000
## 4 2021 Armaan Franklin 1.8666667
## 5 2020 Devonte Green 1.8000000
## 6 2020 Jerome Hunter 0.4210526
## 7 2021 Jerome Hunter 0.6111111
## 8 2020 Joey Brunk 0.4000000
## 9 2020 Justin Smith 1.0000000
## 10 2020 Race Thompson 0.2941176
## 11 2021 Race Thompson 1.3000000
## 12 2020 Rob Phinisee 3.3157895
## 13 2021 Rob Phinisee 3.0500000
## 14 2020 Trayce Jackson-Davis 1.3500000
## 15 2021 Trayce Jackson-Davis 1.5000000
## 16 2021 Trey Galloway 1.5000000
Of the 10 players listed, four only had data available for one season. Thus, six players could be compared from one season to the next. In doing this, the analyst discovered that there was only one players (of the six) who actually saw a decrease in their in-conference AST between the 2019-20 season and the 2020-21 season. Looking at the chart, it was found to be: Rob Phinisee. However, understand and proceed with caution as some players experienced an injury in the latter season.
Now, the same was evaluated for Coach Woodson.
aggregate(cbind(AST) ~ season + player, data = playerMW_conf15, FUN = mean, na.rm = TRUE)
## season player AST
## 1 2023 Jalen Hood-Schifino 3.4444444
## 2 2022 Miller Kopp 1.0434783
## 3 2023 Miller Kopp 1.1000000
## 4 2022 Parker Stewart 1.1363636
## 5 2022 Race Thompson 1.2173913
## 6 2023 Race Thompson 0.8000000
## 7 2022 Rob Phinisee 1.6250000
## 8 2022 Tamar Bates 0.4761905
## 9 2023 Tamar Bates 0.9000000
## 10 2022 Trayce Jackson-Davis 1.8695652
## 11 2023 Trayce Jackson-Davis 4.5500000
## 12 2022 Trey Galloway 2.1333333
## 13 2023 Trey Galloway 1.8000000
## 14 2022 Xavier Johnson 5.2727273
## 15 2023 Xavier Johnson 4.0000000
Of the 9 players listed, three only had data available for one season. Thus, six players could be compared from one season to the next. In doing this, the analyst discovered that there were three players (of the six) who actually saw a decrease in their in-conference AST between the 2021-22 season and the 2022-23 season. Looking at the chart, they were found to be: Race Thompson, Trey Galloway, and Xavier Johnson. However, understand and proceed with caution as some players experienced an injury in the latter season. While these findings may not be directly due to coaching, it is one aspect to consider when evaluating the coach.
REB
Since rebounds do not have the ability to be undefined, the analyst just needed to find the averages in a variety of ways. First find the mean of each position by coach. The table below shows information for Coach Miller.
aggregate(cbind(REB) ~ position, data = playerAM_conf15, FUN = mean, na.rm = TRUE)
## position REB
## 1 F 5.318182
## 2 G 2.342105
Then, the same was done for Coach Woodson.
aggregate(cbind(REB) ~ position, data = playerMW_conf15, FUN = mean, na.rm = TRUE)
## position REB
## 1 F 6.241935
## 2 G 2.358974
As shown above, Woodson’s forwards had a higher average in rebounds than Miller’s. However, the guards between both coaches appear to contribute about the same number of rebounds per conference game. Now, the analyst looked to take this a step further by looking at position by season.
aggregate(cbind(REB) ~ season + position, data = playerAM_conf15, FUN = mean, na.rm = TRUE)
## season position REB
## 1 2020 F 4.802083
## 2 2021 F 6.172414
## 3 2020 G 2.126582
## 4 2021 G 2.575342
Both positions for Miller saw an increase over the two year span. Now, see if the same holds true for Woodson.
aggregate(cbind(REB) ~ season + position, data = playerMW_conf15, FUN = mean, na.rm = TRUE)
## season position REB
## 1 2022 F 6.000000
## 2 2023 F 6.545455
## 3 2022 G 2.281250
## 4 2023 G 2.483333
Similarly for Woodson, both positions also saw growth in the number of rebounds in conference games over his two years as coach. Next, the analyst wanted to look into individual player growth. First, look at Archie Miller’s player development.
aggregate(cbind(REB) ~ season + player, data = playerAM_conf15, FUN = mean, na.rm = TRUE)
## season player REB
## 1 2020 Al Durham 2.000000
## 2 2021 Al Durham 2.600000
## 3 2020 Armaan Franklin 1.000000
## 4 2021 Armaan Franklin 3.666667
## 5 2020 Devonte Green 2.950000
## 6 2020 Jerome Hunter 2.052632
## 7 2021 Jerome Hunter 3.111111
## 8 2020 Joey Brunk 4.800000
## 9 2020 Justin Smith 5.000000
## 10 2020 Race Thompson 4.058824
## 11 2021 Race Thompson 5.950000
## 12 2020 Rob Phinisee 2.578947
## 13 2021 Rob Phinisee 2.500000
## 14 2020 Trayce Jackson-Davis 7.850000
## 15 2021 Trayce Jackson-Davis 9.150000
## 16 2021 Trey Galloway 1.722222
Of the 10 players listed, four only had data available for one season. Thus, six players could be compared from one season to the next. In doing this, the analyst discovered that there was only one players (of the six) who actually saw a decrease in their in-conference REB between the 2019-20 season and the 2020-21 season. Looking at the chart, it was found to be: Rob Phinisee. However, understand and proceed with caution as some players experienced an injury in the latter season.
Now, the same was evaluated for Coach Woodson.
aggregate(cbind(REB) ~ season + player, data = playerMW_conf15, FUN = mean, na.rm = TRUE)
## season player REB
## 1 2023 Jalen Hood-Schifino 3.777778
## 2 2022 Miller Kopp 2.391304
## 3 2023 Miller Kopp 2.800000
## 4 2022 Parker Stewart 2.181818
## 5 2022 Race Thompson 7.739130
## 6 2023 Race Thompson 4.000000
## 7 2022 Rob Phinisee 2.062500
## 8 2022 Tamar Bates 1.095238
## 9 2023 Tamar Bates 1.450000
## 10 2022 Trayce Jackson-Davis 7.869565
## 11 2023 Trayce Jackson-Davis 12.200000
## 12 2022 Trey Galloway 1.733333
## 13 2023 Trey Galloway 2.500000
## 14 2022 Xavier Johnson 4.045455
## 15 2023 Xavier Johnson 1.000000
Of the 9 players listed, three only had data available for one season. Thus, six players could be compared from one season to the next. In doing this, the analyst discovered that there were two players (of the six) who actually saw a decrease in their in-conference REB between the 2021-22 season and the 2022-23 season. Looking at the chart, they were found to be: Race Thompson and Xavier Johnson. However, understand and proceed with caution as some players experienced an injury in the latter season. While these findings may not be directly due to coaching, it is one aspect to consider when evaluating the coach.
It is clear that in most of the categories explored, Woodson’s averages outdid Miller’s. Even more important, from one year to the next, the analyst saw improvement for Woodson and his team in most of the same categories. Considering the teams past four seasons, especially in conference play, improvement is necessary for the Hoosiers to return to overall success. In terms of individual player development, conclusions are harder to come by. Both coaches typically saw improvement over the course of a year in each category amongst their players. However, with an increase in experience and the risk of injury it is hard to say the underlying factor to that improvement is the coach.
In the past 4 years both coaches recruited similarly. In fact, both coaches:
The only difference in the coaches’ recruitment came from the average height of guards which increased an inch (from 6’6” to 6’7”) when Mike Woodson replaced Archie Miller
Overall Win/Loss Insights
Conference Win/Loss Insights
In-Conference Team Performance Insights
Individual player development, conclusions were harder to come by. Both coaches typically saw improvement over the course of a year in each category amongst their players. However, with an increase in experience and the risk of injury it is hard to say the underlying factor to that improvement is the coach.
While the analysis done is thorough, there are shortcomings and limitation to the study conducted. A few have been identified and included below:
For example, more could have been learned if additional data was to be provided. Some ideas for this will be included in the Next Steps tab, but one example of this is including information on all of the years that Archie Miller was head coach. This would give a more overt and complete picture of his entire coaching career as opposed to the team performance in his last two seasons. In fact, it may show his coaching career at IU was more successful than the data set worked with alluded to.
In addition, it is important to notice that most of the analysis was done using descriptive analytics. While this was great in identifying what had happened for the Hoosiers in the past 4 seasons, it does not allow the analyst to project would could happen in the future. Also by using primarily descriptive analytics the findings in this study cannot be generalized to a different team or across a different set of coaches. In other words, the code could be recycled but the analysis would have to be redone.
Lastly, this project had certain time constraints that it needed to follow. Due to the time constraints, as with any study, the study is not as thorough as it could have been. If more time was available the data could have been looked at in different ways and the inclusion of the game logs could have elevated the study to the next level.
Regardless of these limitations, this project still appears to start a good foundation for generating a general recruiting profile and evaluating a team’s performance under a given head coach.
To ensure continued growth and/or success for the program, continued evaluation of the team’s performance under the current acting head coach should be done regularly. One way that this can be done is to monitor the performance rates over the course of each season and report the findings annually. As the team gets better it is unlikely to see as much improvement as was present in this study, thus continued research could include but is not limited too:
A larger focus on the team’s strengths or weaknesses as opposed
to comparison of averages by year. Identification of these areas can
help the coaches play into their teams strengths providing coaches with
data driven strategies and game plans. In addition, identification of
weaknesses can help the coach focus on this area of play when preparing
for the next season so it is not such a detriment to the team.
Inclusion of the game logs to derive useful in game trends. This could allow the coaches to see which strategies were effective in the game and which strategies did not provide evidence of working. This could also provide better insights on individual performances in specific moments of the game. Note, that this analysis could also be done in regards to the team’s opponents so that valuable scouting information would also be available to improve overall ream performance.
Using NCAA DI Men’s basketball data to find benchmarks that denote a successful team. This could be done by aggregating season averages for teams that made the NCAA tournament the past season(s) and then comparing these results to that of the team’s most recent or current season. This method proves beneficial as it provides context for how the Hoosiers did in comparison to other (successful) teams as opposed to a comparison of last years team. As stated above it also gives the coaches an expectation on exactly how much improvement needs to take place to take their team to the next level.