Points: 50

Overview

An important element of data-driven decisions is the ability to visually communicate your data and interpretations. As “big data” and data analysis become more important yet more complex, it is even more important to understand the basic principles of data visualization.

Purpose

This assignment aligns with the second course objective: create visualizations using R.

Dataset and Submission Instructions

The dataset UMDBasketball2022.csv contains information of Maryland Terrapins school history from 1923 to 2022. The data was originally scraped from https://www.sports-reference.com/cbb/schools/maryland/men/. In this assignment, we will use this data set to study the team’s overall wins and coaches’ performance. A data dictionary can be found at the end of this document.

Visualization Guidelines

Make sure to change the axis titles, chart titles, colors, size, font, legend locations etc. if needed. Categories should be in a meaningful order, if appropriate. Also, format the grid lines and data series appropriately to enhance the clarity of the graph. Remember to write an informative title with some insights (except Q4 d).

You must turn in a well-formatted HTML output file. To do so, please click on the Knit button at the top with the wool ball icon. Make sure to include both the code chunks and their corresponding output charts in the generated output file.

Q1. Explore the distribution of overall wins. (8 points)

  1. Create a boxplot that examines the distribution of overall wins. (2 points)

  2. Add points of overall wins using the geom_jitter function. (2 points)

  3. Add a title to describe your main finding. (2 points)

  4. Improve your chart to make it clear and ready for presenting to your readers. (2 points)

(You only need to present a single chart with all the required information mentioned above)

library(ggplot2)

# Assuming UMDBasketball2022.csv is in your working directory
data <- read.csv("UMDBasketball2022.csv")

ggplot(data, aes(x = factor(1), y = OverallWins)) + 
  geom_boxplot() + 
  geom_jitter(aes(color = "Overall Wins"), width = 0.2) +
  labs(title = "Distribution of Overall Wins for Maryland Terrapins", 
       x = "", 
       y = "Overall Wins") +
  theme_minimal() + 
theme(legend.position = "none", 
    plot.title = element_text(hjust = 0.5))


Q2. Explore the correlations between numeric variables. (8 points)

  1. Create a correlations heat map for the following variables: OverallWins, ConferenceWins, SRS, SOS, PTS, and Opponents PTS. (3 points)

  2. Improve your chart to make it clear and ready for presenting to your readers. (3 points)

# Install and load the corrplot package if not already installed
if (!requireNamespace("corrplot", quietly = TRUE)) install.packages("corrplot")
library(corrplot)

selected_data <- umd_data[,c("OverallWins", "ConferenceWins", "SRS", "SOS", "PTS", "Opponents PTS")]

cor_matrix <- cor(selected_data, use = "complete.obs")

corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
         tl.col = "black", tl.srt = 45, # text label color and rotation
         addCoef.col = "black", # color for the correlation coefficients
         number.cex = 0.7, # size of the correlation coefficients
         title = "Correlation Heatmap of Basketball Performance Metrics",
         mar = c(0,0,1,0)) # margins for the plot

  1. Which variables are positively correlated with overall wins? Which variable is most correlated with overall wins?  (2 points)

    Answer here: Variable that is most correlated with OverallWins, is ConferenceWins, with a correlation coefficient of 0.76. Suggesting that teams with more wins in their conference tend to have more overall wins, indicating a strong positive correlation.


Q3. Explore the relationship between overall wins and conference wins. (12 points)

  1. Create a scatter plot of the overall wins and conference wins; use different colors or shapes to denote difference conferences (ACC, Big Ten and Southern). (3 points)

  2. Add a single trend line to the chart. (3 points)

  3. Improve your chart to make it clear and ready for presenting to your readers. (3 points)

library(ggplot2)

ggplot(data, aes(x = OverallWins, y = ConferenceWins, color = Conf, shape = Conf)) + 
  geom_point() +  # Plot the individual data points
  geom_smooth(method = "lm", se = FALSE, aes(group = 1)) +  # Add a single trend line for all data
  scale_color_manual(values = c("ACC" = "red", "Big Ten" = "blue", "Southern" = "green")) +
  scale_shape_manual(values = c("ACC" = 17, "Big Ten" = 15, "Southern" = 18)) +
  labs(
    title = "Relationship Between Overall Wins and Conference Wins",
    x = "Overall Wins",
    y = "Conference Wins",
    color = "Conference",
    shape = "Conference"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14),  # Center and resize the plot title
    legend.position = "bottom",  # Position the legend at the bottom
    plot.margin = margin(t = 30, r = 30, b = 30, l = 30)  # Add margin space around the plot
  )

  1. What pattern do you notice? (3 points)

    Answer here:

    1. ACC (red triangles) & the Big Ten (blue squares) show a more dispersed set of points, suggesting a wider variance in the no. of conference wins for teams with similar overall wins.

    2. Southern Conference (green diamonds) appears to have fewer data points, suggesting there are fewer teams in the dataset, or they have a narrower range of overall wins.

    3. Trend line (in blue) indicates the general direction of the relationship between overall wins & conference wins, showing that as 1 goes up, so does the other. The trend line seems to fit the data points for the ACC & Big Ten better than for the Southern Conference,thus has a stronger correlation for the former 2.


Q4. Explore the change of overall wins over years. (12 points)

  1. Create a line chart for the time series of overall wins. (3 points)

  2. Add a vertical line at x = 2010. (2 points)

  3. Improve your chart to make it clear and ready for presenting to your readers. (3 points)

timeseries<-ggplot(data, aes(x = Year, y = OverallWins)) +
  geom_line() + # Add line geometry
  geom_vline(xintercept = 2010, linetype="dashed", color = "red4") + # Add vertical line at x = 2010
  labs(title = "Time Series of Overall Wins",
       x = "Year",
       y = "Overall Wins") +
  theme_minimal() + # Use a minimal theme for clarity
  theme(plot.title = element_text(hjust = 0.5), # Center the plot title
        axis.text.x = element_text(angle = 45, hjust = 1))
timeseries

  1. Create an animated version of the time series to show how overall wins changes over years. The title of your chart should dynamically display the current year. (4 points)
library(ggplot2)
library(gganimate)

# Assuming timeseries is already created with your first block of code

# Modify the ggplot object for animation
line_animated <- timeseries +
  labs(title ='Year: {frame_along}', subtitle = 'Overall Wins ') +
  transition_reveal(Year) +  # Ensure Year is the correct column name
  ease_aes('linear')  # Correct usage of ease_aes

# Generate the animation
animate(line_animated, fps = 8, width = 800, height = 600, renderer = gifski_renderer())


Q5. Explore the number of seasons that each coach makes it to the NCAA tournament and the number of seasons he/she does not. (10 points)

  1. Create a stacked bar chart to show the number of seasons that each coach makes it to the NCAA tournament and the number of seasons they do not. (3 points)

  2. Order the coaches based on their first year of serving as the coach at UMD. (2 points)

  3. Improve your chart to make it clear and ready for presenting to your readers. (3 points)

library(ggplot2)
library(dplyr)
library(tidyr)
library(readr) # for read_csv

# Fix the NCAA Tournament column name if it contains periods instead of spaces
names(umd_data) <- gsub("\\.", " ", names(umd_data))

# Calculate NCAA appearances and non-appearances
umd_coach_ncaa <- umd_data %>%
  mutate(NCAA_Appearance = !is.na(`NCAA Tournament`)) %>%
  group_by(Coach) %>%
  summarise(Seasons_Made_NCAA = sum(NCAA_Appearance),
            Seasons_Not_Made_NCAA = n() - sum(NCAA_Appearance),
            First_Year = min(Year)) %>%
  ungroup() %>%
  arrange(First_Year)

# Reshape data for plotting
umd_coach_ncaa_long <- umd_coach_ncaa %>%
  select(Coach, Seasons_Made_NCAA, Seasons_Not_Made_NCAA, First_Year) %>%
  pivot_longer(cols = c("Seasons_Made_NCAA", "Seasons_Not_Made_NCAA"),
               names_to = "NCAA_Status", values_to = "Seasons")

# Create the stacked bar chart
ggplot(umd_coach_ncaa_long, aes(x = reorder(Coach, First_Year), y = Seasons, fill = NCAA_Status)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("Seasons_Made_NCAA" = "blue4", "Seasons_Not_Made_NCAA" = "red4"),
                    labels = c("Seasons_Made_NCAA" = "Made NCAA", "Seasons_Not_Made_NCAA" = "Did Not Make NCAA")) +
  labs(title = "UMD Coaches' NCAA Tournament Seasons",
       x = "Coach (Ordered by First Year)",
       y = "Number of Seasons",
       fill = "NCAA Status") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
        legend.title = element_text(face = "bold"),
        legend.position = "bottom")

# Determine the best coach in terms of the number of seasons made it to the NCAA
best_coach <- umd_coach_ncaa %>%
  filter(Seasons_Made_NCAA == max(Seasons_Made_NCAA)) %>%
  pull(Coach)

# Print the best coach
print(paste("The best coach in terms of the number of seasons made to the NCAA is:", best_coach))
## [1] "The best coach in terms of the number of seasons made to the NCAA is: Gary Williams"
  1. Which coach is the best in terms of the number of seasons that he/she makes it to the NCAA? (2 points)

    Answer here:

    Gary Williams"


Data appendix

Column/Field Definition
Season NCAA Tournament appearance
Year Season year
Conf Conference
OverallWins Wins
OverallLosses Losses
OverallWinLossPercent Win-Loss percentage
ConferenceWins Conference Wins
ConferenceLosses Conference Losses
ConferenceWinLossPercent Conference Win-Loss percentage
SRS Simple Rating System: A rating that takes into account average point differential and strength of schedule. The rating is denominated in points above/below average, where zero is average.
SOS Strength of Schedule: A rating of strength of schedule. The rating is denominated in points above/below average, where zero is average.
PTS Points Per Game
Opponents PTS Opponent Points Per Game
AP Pre Rank in pre-season AP poll
AP High Highest rank in AP poll during the season
AP Final Rank in final AP poll
NCAA Tournament NCAA Tournament result
Seed Seed number
Coach Coach and their win – loss during the season