An important element of data-driven decisions is the ability to visually communicate your data and interpretations. As “big data” and data analysis become more important yet more complex, it is even more important to understand the basic principles of data visualization.
This assignment aligns with the second course objective: create visualizations using R.
The dataset UMDBasketball2022.csv contains information of Maryland Terrapins school history from 1923 to 2022. The data was originally scraped from https://www.sports-reference.com/cbb/schools/maryland/men/. In this assignment, we will use this data set to study the team’s overall wins and coaches’ performance. A data dictionary can be found at the end of this document.
Make sure to change the axis titles, chart titles, colors, size, font, legend locations etc. if needed. Categories should be in a meaningful order, if appropriate. Also, format the grid lines and data series appropriately to enhance the clarity of the graph. Remember to write an informative title with some insights (except Q4 d).
You must turn in a well-formatted HTML output file. To do so, please click on the Knit button at the top with the wool ball icon. Make sure to include both the code chunks and their corresponding output charts in the generated output file.
Create a boxplot that examines the distribution of overall wins. (2 points)
Add points of overall wins using the geom_jitter function. (2 points)
Add a title to describe your main finding. (2 points)
Improve your chart to make it clear and ready for presenting to your readers. (2 points)
(You only need to present a single chart with all the required information mentioned above)
library(ggplot2)
# Assuming UMDBasketball2022.csv is in your working directory
data <- read.csv("UMDBasketball2022.csv")
ggplot(data, aes(x = factor(1), y = OverallWins)) +
geom_boxplot() +
geom_jitter(aes(color = "Overall Wins"), width = 0.2) +
labs(title = "Distribution of Overall Wins for Maryland Terrapins",
x = "",
y = "Overall Wins") +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5))
Create a correlations heat map for the following variables: OverallWins, ConferenceWins, SRS, SOS, PTS, and Opponents PTS. (3 points)
Improve your chart to make it clear and ready for presenting to your readers. (3 points)
# Install and load the corrplot package if not already installed
if (!requireNamespace("corrplot", quietly = TRUE)) install.packages("corrplot")
library(corrplot)
selected_data <- umd_data[,c("OverallWins", "ConferenceWins", "SRS", "SOS", "PTS", "Opponents PTS")]
cor_matrix <- cor(selected_data, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45, # text label color and rotation
addCoef.col = "black", # color for the correlation coefficients
number.cex = 0.7, # size of the correlation coefficients
title = "Correlation Heatmap of Basketball Performance Metrics",
mar = c(0,0,1,0)) # margins for the plot
Which variables are positively correlated with overall wins? Which variable is most correlated with overall wins? (2 points)
Answer here: Variable that is most correlated with OverallWins, is ConferenceWins, with a correlation coefficient of 0.76. Suggesting that teams with more wins in their conference tend to have more overall wins, indicating a strong positive correlation.
Create a scatter plot of the overall wins and conference wins; use different colors or shapes to denote difference conferences (ACC, Big Ten and Southern). (3 points)
Add a single trend line to the chart. (3 points)
Improve your chart to make it clear and ready for presenting to your readers. (3 points)
library(ggplot2)
ggplot(data, aes(x = OverallWins, y = ConferenceWins, color = Conf, shape = Conf)) +
geom_point() + # Plot the individual data points
geom_smooth(method = "lm", se = FALSE, aes(group = 1)) + # Add a single trend line for all data
scale_color_manual(values = c("ACC" = "red", "Big Ten" = "blue", "Southern" = "green")) +
scale_shape_manual(values = c("ACC" = 17, "Big Ten" = 15, "Southern" = 18)) +
labs(
title = "Relationship Between Overall Wins and Conference Wins",
x = "Overall Wins",
y = "Conference Wins",
color = "Conference",
shape = "Conference"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, size = 14), # Center and resize the plot title
legend.position = "bottom", # Position the legend at the bottom
plot.margin = margin(t = 30, r = 30, b = 30, l = 30) # Add margin space around the plot
)
What pattern do you notice? (3 points)
Answer here:
ACC (red triangles) & the Big Ten (blue squares) show a more dispersed set of points, suggesting a wider variance in the no. of conference wins for teams with similar overall wins.
Southern Conference (green diamonds) appears to have fewer data points, suggesting there are fewer teams in the dataset, or they have a narrower range of overall wins.
Trend line (in blue) indicates the general direction of the relationship between overall wins & conference wins, showing that as 1 goes up, so does the other. The trend line seems to fit the data points for the ACC & Big Ten better than for the Southern Conference,thus has a stronger correlation for the former 2.
Create a line chart for the time series of overall wins. (3 points)
Add a vertical line at x = 2010. (2 points)
Improve your chart to make it clear and ready for presenting to your readers. (3 points)
timeseries<-ggplot(data, aes(x = Year, y = OverallWins)) +
geom_line() + # Add line geometry
geom_vline(xintercept = 2010, linetype="dashed", color = "red4") + # Add vertical line at x = 2010
labs(title = "Time Series of Overall Wins",
x = "Year",
y = "Overall Wins") +
theme_minimal() + # Use a minimal theme for clarity
theme(plot.title = element_text(hjust = 0.5), # Center the plot title
axis.text.x = element_text(angle = 45, hjust = 1))
timeseries
library(ggplot2)
library(gganimate)
# Assuming timeseries is already created with your first block of code
# Modify the ggplot object for animation
line_animated <- timeseries +
labs(title ='Year: {frame_along}', subtitle = 'Overall Wins ') +
transition_reveal(Year) + # Ensure Year is the correct column name
ease_aes('linear') # Correct usage of ease_aes
# Generate the animation
animate(line_animated, fps = 8, width = 800, height = 600, renderer = gifski_renderer())
Create a stacked bar chart to show the number of seasons that each coach makes it to the NCAA tournament and the number of seasons they do not. (3 points)
Order the coaches based on their first year of serving as the coach at UMD. (2 points)
Improve your chart to make it clear and ready for presenting to your readers. (3 points)
library(ggplot2)
library(dplyr)
library(tidyr)
library(readr) # for read_csv
# Fix the NCAA Tournament column name if it contains periods instead of spaces
names(umd_data) <- gsub("\\.", " ", names(umd_data))
# Calculate NCAA appearances and non-appearances
umd_coach_ncaa <- umd_data %>%
mutate(NCAA_Appearance = !is.na(`NCAA Tournament`)) %>%
group_by(Coach) %>%
summarise(Seasons_Made_NCAA = sum(NCAA_Appearance),
Seasons_Not_Made_NCAA = n() - sum(NCAA_Appearance),
First_Year = min(Year)) %>%
ungroup() %>%
arrange(First_Year)
# Reshape data for plotting
umd_coach_ncaa_long <- umd_coach_ncaa %>%
select(Coach, Seasons_Made_NCAA, Seasons_Not_Made_NCAA, First_Year) %>%
pivot_longer(cols = c("Seasons_Made_NCAA", "Seasons_Not_Made_NCAA"),
names_to = "NCAA_Status", values_to = "Seasons")
# Create the stacked bar chart
ggplot(umd_coach_ncaa_long, aes(x = reorder(Coach, First_Year), y = Seasons, fill = NCAA_Status)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("Seasons_Made_NCAA" = "blue4", "Seasons_Not_Made_NCAA" = "red4"),
labels = c("Seasons_Made_NCAA" = "Made NCAA", "Seasons_Not_Made_NCAA" = "Did Not Make NCAA")) +
labs(title = "UMD Coaches' NCAA Tournament Seasons",
x = "Coach (Ordered by First Year)",
y = "Number of Seasons",
fill = "NCAA Status") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
legend.title = element_text(face = "bold"),
legend.position = "bottom")
# Determine the best coach in terms of the number of seasons made it to the NCAA
best_coach <- umd_coach_ncaa %>%
filter(Seasons_Made_NCAA == max(Seasons_Made_NCAA)) %>%
pull(Coach)
# Print the best coach
print(paste("The best coach in terms of the number of seasons made to the NCAA is:", best_coach))
## [1] "The best coach in terms of the number of seasons made to the NCAA is: Gary Williams"
Which coach is the best in terms of the number of seasons that he/she makes it to the NCAA? (2 points)
Answer here:
Gary Williams"| Column/Field | Definition |
| Season | NCAA Tournament appearance |
| Year | Season year |
| Conf | Conference |
| OverallWins | Wins |
| OverallLosses | Losses |
| OverallWinLossPercent | Win-Loss percentage |
| ConferenceWins | Conference Wins |
| ConferenceLosses | Conference Losses |
| ConferenceWinLossPercent | Conference Win-Loss percentage |
| SRS | Simple Rating System: A rating that takes into account average point differential and strength of schedule. The rating is denominated in points above/below average, where zero is average. |
| SOS | Strength of Schedule: A rating of strength of schedule. The rating is denominated in points above/below average, where zero is average. |
| PTS | Points Per Game |
| Opponents PTS | Opponent Points Per Game |
| AP Pre | Rank in pre-season AP poll |
| AP High | Highest rank in AP poll during the season |
| AP Final | Rank in final AP poll |
| NCAA Tournament | NCAA Tournament result |
| Seed | Seed number |
| Coach | Coach and their win – loss during the season |