Points: 50

Overview

An important element of data-driven decisions is the ability to visually communicate your data and interpretations. As “big data” and data analysis become more important yet more complex, it is even more important to understand the basic principles of data visualization.

Purpose

This assignment aligns with the second course objective: create visualizations using R.

Dataset and Submission Instructions

The dataset UMDBasketball2022.csv contains information of Maryland Terrapins school history from 1923 to 2022. The data was originally scraped from https://www.sports-reference.com/cbb/schools/maryland/men/. In this assignment, we will use this data set to study the team’s overall wins and coaches’ performance. A data dictionary can be found at the end of this document.

Visualization Guidelines

Make sure to change the axis titles, chart titles, colors, size, font, legend locations etc. if needed. Categories should be in a meaningful order, if appropriate. Also, format the grid lines and data series appropriately to enhance the clarity of the graph. Remember to write an informative title with some insights (except Q4 d).

You must turn in a well-formatted HTML output file. To do so, please click on the Knit button at the top with the wool ball icon. Make sure to include both the code chunks and their corresponding output charts in the generated output file.

Q1. Explore the distribution of overall wins. (8 points)

  1. Create a boxplot that examines the distribution of overall wins. (2 points)

  2. Add points of overall wins using the geom_jitter function. (2 points)

  3. Add a title to describe your main finding. (2 points)

  4. Improve your chart to make it clear and ready for presenting to your readers. (2 points)

(You only need to present a single chart with all the required information mentioned above)

ggplot(umd_data, aes(x = factor(1), y = OverallWins)) +
  geom_boxplot() +
  geom_jitter(aes(color = Conf), width = 0.2) +
  labs(title = "Distribution of Overall Wins for UMD Basketball",
       x = "",
       y = "Overall Wins") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        legend.title = element_blank(),
        legend.position = "bottom") +
  scale_color_discrete(name = "Conference")


Q2. Explore the correlations between numeric variables. (8 points)

  1. Create a correlations heat map for the following variables: OverallWins, ConferenceWins, SRS, SOS, PTS, and Opponents PTS. (3 points)

  2. Improve your chart to make it clear and ready for presenting to your readers. (3 points)

selected_data <- umd_data[,c("OverallWins", "ConferenceWins", "SRS", "SOS", "PTS", "Opponents PTS")]

cor_matrix <- cor(selected_data, use = "complete.obs")

corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
         tl.col = "black", tl.srt = 45, 
         addCoef.col = "black", 
         number.cex = 0.7,
         title = "Correlation Heatmap of Basketball Performance",
         mar = c(0,0,1,0))

  1. Which variables are positively correlated with overall wins? Which variable is most correlated with overall wins?  (2 points)

    Answer here:
    Based on the correlation heatmap:

    Positively correlated variables with Overall wins include:

    • Conference wins with a correlation of 0.76.

    • SRS (Simple Rating System) with a correlation of 0.8.

    • SOS (Strength of Schedule) with a correlation of 0.45.

    • PTS (Points Scored) with a correlation of 0.48.

    The variable that is most correlated with Overall wins is SRS, with a correlation coefficient of 0.8, indicating a strong positive relationship. This suggests that the Simple Rating System, which takes into account point differential and strength of schedule, is a good predictor of overall winning performance.


Q3. Explore the relationship between overall wins and conference wins. (12 points)

  1. Create a scatter plot of the overall wins and conference wins; use different colors or shapes to denote difference conferences (ACC, Big Ten and Southern). (3 points)

  2. Add a single trend line to the chart. (3 points)

  3. Improve your chart to make it clear and ready for presenting to your readers. (3 points)

ggplot(umd_data, aes(x = ConferenceWins, y = OverallWins)) +
  geom_point(aes(color = Conf, shape = Conf)) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  scale_color_manual(values = c("ACC" = "purple4", "Big Ten" = "red4", "Southern" = "pink4")) +
  labs(title = "Relationship Between Overall Wins and Conference Wins by Conference",
       x = "Conference Wins",
       y = "Overall Wins",
       color = "Conference",
       shape = "Conference") +
  theme_minimal() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5))

  1. What pattern do you notice? (3 points)

    Answer here: The scatter plot shows a positive relationship between Overall wins and Conference wins, indicating that as conference wins increase, overall wins tend to increase as well. This trend is consistent across the three conferences represented: ACC, Big Ten, and Southern.

    While the data points for each conference are distinct, they generally follow the same upward trajectory, implying that the relationship between conference wins and overall wins does not significantly differ by conference. There is some spread in the data points around the trend line, indicating variability in the relationship between conference and overall wins within each conference.


Q4. Explore the change of overall wins over years. (12 points)

  1. Create a line chart for the time series of overall wins. (3 points)

  2. Add a vertical line at x = 2010. (2 points)

  3. Improve your chart to make it clear and ready for presenting to your readers. (3 points)

ts<- ggplot(umd_data, aes(x = Year, y = OverallWins)) +
  geom_line() + # Add line geometry
  geom_vline(xintercept = 2010, linetype="dashed", color = "red4") + # Add vertical line at x = 2010
  labs(title = "Time Series of Overall Wins",
       x = "Year",
       y = "Overall Wins") +
  theme_minimal() + # Use a minimal theme for clarity
  theme(plot.title = element_text(hjust = 0.5), # Center the plot title
        axis.text.x = element_text(angle = 45, hjust = 1))
ts

  1. Create an animated version of the time series to show how overall wins changes over years. The title of your chart should dynamically display the current year. (4 points)
library(ggplot2)
library(gganimate)

line_animated <- ts +
  labs(title ='Year: {frame_along}', subtitle = 'Overall Wins ') +
  transition_reveal(Year) +
  ease_aes(model = 'linear')

animate(line_animated,fps=8)


Q5. Explore the number of seasons that each coach makes it to the NCAA tournament and the number of seasons he/she does not. (10 points)

  1. Create a stacked bar chart to show the number of seasons that each coach makes it to the NCAA tournament and the number of seasons they do not. (3 points)

  2. Order the coaches based on their first year of serving as the coach at UMD. (2 points)

  3. Improve your chart to make it clear and ready for presenting to your readers. (3 points)

library(ggplot2)
library(dplyr)
library(tidyr)
library(readr) 
names(umd_data) <- gsub("\\.", " ", names(umd_data))

umd_coach_ncaa <- umd_data %>%
  mutate(NCAA_Appearance = !is.na(`NCAA Tournament`)) %>%
  group_by(Coach) %>%
  summarise(Seasons_Made_NCAA = sum(NCAA_Appearance),
            Seasons_Not_Made_NCAA = n() - sum(NCAA_Appearance),
            First_Year = min(Year)) %>%
  ungroup() %>%
  arrange(First_Year)

umd_coach_ncaa_long <- umd_coach_ncaa %>%
  select(Coach, Seasons_Made_NCAA, Seasons_Not_Made_NCAA, First_Year) %>%
  pivot_longer(cols = c("Seasons_Made_NCAA", "Seasons_Not_Made_NCAA"),
               names_to = "NCAA_Status", values_to = "Seasons")

ggplot(umd_coach_ncaa_long, aes(x = reorder(Coach, First_Year), y = Seasons, fill = NCAA_Status)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("Seasons_Made_NCAA" = "blue4", "Seasons_Not_Made_NCAA" = "red4"),
                    labels = c("Seasons_Made_NCAA" = "Made NCAA", "Seasons_Not_Made_NCAA" = "Did Not Make NCAA")) +
  labs(title = "UMD Coaches' NCAA Tournament Seasons",
       x = "Coach (Ordered by First Year)",
       y = "Number of Seasons",
       fill = "NCAA Status") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
        legend.title = element_text(face = "bold"),
        legend.position = "bottom")

best_coach <- umd_coach_ncaa %>%
  filter(Seasons_Made_NCAA == max(Seasons_Made_NCAA)) %>%
  pull(Coach)

print(paste("The best coach in terms of the number of seasons made to the NCAA is:", best_coach))
## [1] "The best coach in terms of the number of seasons made to the NCAA is: Gary Williams"
  1. Which coach is the best in terms of the number of seasons that he/she makes it to the NCAA? (2 points)

    Answer here: The best coach in terms of the number of seasons made to the NCAA is: Gary Williams.

Data appendix

Column/Field Definition
Season NCAA Tournament appearance
Year Season year
Conf Conference
OverallWins Wins
OverallLosses Losses
OverallWinLossPercent Win-Loss percentage
ConferenceWins Conference Wins
ConferenceLosses Conference Losses
ConferenceWinLossPercent Conference Win-Loss percentage
SRS Simple Rating System: A rating that takes into account average point differential and strength of schedule. The rating is denominated in points above/below average, where zero is average.
SOS Strength of Schedule: A rating of strength of schedule. The rating is denominated in points above/below average, where zero is average.
PTS Points Per Game
Opponents PTS Opponent Points Per Game
AP Pre Rank in pre-season AP poll
AP High Highest rank in AP poll during the season
AP Final Rank in final AP poll
NCAA Tournament NCAA Tournament result
Seed Seed number
Coach Coach and their win – loss during the season