An important element of data-driven decisions is the ability to visually communicate your data and interpretations. As “big data” and data analysis become more important yet more complex, it is even more important to understand the basic principles of data visualization.
This assignment aligns with the second course objective: create visualizations using R.
The dataset UMDBasketball2022.csv contains information of Maryland Terrapins school history from 1923 to 2022. The data was originally scraped from https://www.sports-reference.com/cbb/schools/maryland/men/. In this assignment, we will use this data set to study the team’s overall wins and coaches’ performance. A data dictionary can be found at the end of this document.
Make sure to change the axis titles, chart titles, colors, size, font, legend locations etc. if needed. Categories should be in a meaningful order, if appropriate. Also, format the grid lines and data series appropriately to enhance the clarity of the graph. Remember to write an informative title with some insights (except Q4 d).
You must turn in a well-formatted HTML output file. To do so, please click on the Knit button at the top with the wool ball icon. Make sure to include both the code chunks and their corresponding output charts in the generated output file.
Create a boxplot that examines the distribution of overall wins. (2 points)
Add points of overall wins using the geom_jitter function. (2 points)
Add a title to describe your main finding. (2 points)
Improve your chart to make it clear and ready for presenting to your readers. (2 points)
(You only need to present a single chart with all the required information mentioned above)
ggplot(umd_data, aes(x = factor(1), y = OverallWins)) +
geom_boxplot() +
geom_jitter(aes(color = Conf), width = 0.2) +
labs(title = "Distribution of overall wins for UMD Basketball - 2022 Season",
x = "",
y = "Overall Wins") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
legend.title = element_blank(),
legend.position = "bottom") +
scale_color_manual(name = "Conference", values = c("black", "green","red"))
Create a correlations heat map for the following variables: OverallWins, ConferenceWins, SRS, SOS, PTS, and Opponents PTS. (3 points)
Improve your chart to make it clear and ready for presenting to your readers. (3 points)
selected_data <- umd_data[,c("OverallWins", "ConferenceWins", "SRS", "SOS", "PTS", "Opponents PTS")]
cor_matrix <- cor(selected_data, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45,
addCoef.col = "black",
number.cex = 0.7,
title = "Correlation Heatmap of Basketball Performance",
mar=c(0,0,1,0))
Which variables are positively correlated with overall wins? Which variable is most correlated with overall wins? (2 points)
Answer here:
Variables that Positive correlated with overall wins are:
Conference wins with a correlation of 0.76
SRS with a correlation of 0.8
SOS with correlation of 0.45
PTS with correlation of 0.48
The variale that is most correlated with overall wins is SRS(Simple Rating System) with correlation of 0.48 this is a good predictor of overall winning performance.
Create a scatter plot of the overall wins and conference wins; use different colors or shapes to denote difference conferences (ACC, Big Ten and Southern). (3 points)
Add a single trend line to the chart. (3 points)
Improve your chart to make it clear and ready for presenting to your readers. (3 points)
ggplot(umd_data, aes(x = ConferenceWins, y = OverallWins)) +
geom_point(aes(color = Conf, shape = Conf)) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
scale_color_manual(values = c("ACC" = "purple", "Big Ten" = "red", "Southern" = "brown")) +
labs(title = "Relationship Between Overall Wins and Conference Wins by Conference",
x = "Conference Wins",
y = "Overall Wins",
color = "Conference",
shape = "Conference") +
theme_minimal() +
theme(legend.position = "bottom",
plot.title = element_text(hjust=0.5))
What pattern do you notice? (3 points)
Answer here:
The scatter plot indicates a positive correlation among conference wins and overall wins, suggesting that overall wins often rise in together with conference wins. The ACC, Big Ten, and Southern conferences are the three conference that are represented in this pattern.
Create a line chart for the time series of overall wins. (3 points)
Add a vertical line at x = 2010. (2 points)
Improve your chart to make it clear and ready for presenting to your readers. (3 points)
plot<-ggplot(umd_data, aes(x = Year, y = OverallWins)) +
geom_line() + # Add line geometry
geom_vline(xintercept = 2010, linetype="dashed", color = "red") + # Add vertical line at x = 2010
labs(title = "Time Series of Overall Wins",
x = "Year",
y = "Overall Wins") +
theme_minimal() + # minimal theme for clarity
theme(plot.title = element_text(hjust = 0.5), # Center the plot title
axis.text.x = element_text(angle =45,hjust=1))
plot
library(ggplot2)
library(gganimate)
line_animated <- plot +
labs(title ='Year: {frame_along}', subtitle = 'Overall Wins') +
transition_reveal(Year) +
ease_aes(model = 'linear')
animate(line_animated,fps=8)
Create a stacked bar chart to show the number of seasons that each coach makes it to the NCAA tournament and the number of seasons they do not. (3 points)
Order the coaches based on their first year of serving as the coach at UMD. (2 points)
Improve your chart to make it clear and ready for presenting to your readers. (3 points)
library(ggplot2)
library(dplyr)
library(tidyr)
library(readr)
names(umd_data) <- gsub("\\.", " ", names(umd_data))
umd_coach_ncaa <- umd_data %>%
mutate(NCAA_Appearance = !is.na(`NCAA Tournament`)) %>%
group_by(Coach) %>%
summarise(Seasons_Made_NCAA = sum(NCAA_Appearance),
Seasons_Not_Made_NCAA = n() - sum(NCAA_Appearance),
First_Year = min(Year)) %>%
ungroup() %>%
arrange(First_Year)
umd_coach_ncaa_long <- umd_coach_ncaa %>%
select(Coach, Seasons_Made_NCAA, Seasons_Not_Made_NCAA, First_Year) %>%
pivot_longer(cols = c("Seasons_Made_NCAA", "Seasons_Not_Made_NCAA"),
names_to = "NCAA_Status", values_to = "Seasons")
ggplot(umd_coach_ncaa_long, aes(x = reorder(Coach, First_Year), y = Seasons, fill = NCAA_Status)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("Seasons_Made_NCAA" = "brown", "Seasons_Not_Made_NCAA" = "pink4"),
labels = c("Seasons_Made_NCAA" = "Made NCAA", "Seasons_Not_Made_NCAA" = "Did Not Make NCAA")) +
labs(title = "UMD Coaches' NCAA Tournament Seasons",
x = "Coach (Ordered by First Year)",
y = "Number of Seasons",
fill = "NCAA Status") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
legend.title = element_text(face = "bold"),
legend.position = "bottom")
best_coach <- umd_coach_ncaa %>%
filter(Seasons_Made_NCAA == max(Seasons_Made_NCAA)) %>%
pull(Coach)
Which coach is the best in terms of the number of seasons that he/she makes it to the NCAA? (2 points)
Answer here:
Gary Williams is the best coach in terms of the number of seasons that he/she makes it to the NCAA.
| Column/Field | Definition |
| Season | NCAA Tournament appearance |
| Year | Season year |
| Conf | Conference |
| OverallWins | Wins |
| OverallLosses | Losses |
| OverallWinLossPercent | Win-Loss percentage |
| ConferenceWins | Conference Wins |
| ConferenceLosses | Conference Losses |
| ConferenceWinLossPercent | Conference Win-Loss percentage |
| SRS | Simple Rating System: A rating that takes into account average point differential and strength of schedule. The rating is denominated in points above/below average, where zero is average. |
| SOS | Strength of Schedule: A rating of strength of schedule. The rating is denominated in points above/below average, where zero is average. |
| PTS | Points Per Game |
| Opponents PTS | Opponent Points Per Game |
| AP Pre | Rank in pre-season AP poll |
| AP High | Highest rank in AP poll during the season |
| AP Final | Rank in final AP poll |
| NCAA Tournament | NCAA Tournament result |
| Seed | Seed number |
| Coach | Coach and their win – loss during the season |