Syllabus for Stat 360 (Data Visualization)

General Information

Course Instructor: Dr. Shiju Zhang
Meeting Dates and Time: None
Classroom: Ch/16
Instructor’s E-Mail: szhang@stcloudstate.edu
Office Hours: TH 11:00 am - 12:15 pm or by appointment via email
Zoom: https://minnstate.zoom.us/j/96907848410 (Passcode: 3.14)
Textbook: None

Course Description

Explore visual representations of data for exploratory analysis. Traditional and contemporary visual techniques to improve the understanding and communication of complex data. Good design practices for visualization and presentation of analytics. Extensive use of software.

Course Objectives

By the end of the course, students should be able to:

Recommend, construct, and interpret appropriate visualizations for various types of data.
Evaluate appropriate analysis techniques using visual representations.
Organize and communicate complex information concisely using data visualization.

Assessment & Grading

Student grades will be determined by a combination of five projects. Feel free to leverage ChatGPT’s capabilities, but it’s important to familiarize yourself with effective methods of interacting with the bot.

Appendix I contains five project descriptions.
For guidance on crafting a project report, refer to Appendix II, where you’ll find an example of detailed instructions and an exemplary report.

The grading scale is as follows:

A = 90%; B = 80%; C = 70 %; D= 60%; F = below 60%

Course Outline

Week 1: Fundamental R programming concepts, various data management techniques, encompassing operations such as importing, filtering, selecting, aggregating, and mutating.
Week 2-3: Visualization tools using the base package but also explore more advanced visualization capabilities using ggplot2.
Week 4-5: Interactive visualization packages: plotly and leaflet.
Weeks 6-16: Working on projects.

Let me provide you with an overview of how the intended visualizations will appear once constructed.

Plot w/ base R

Plot w/ ggplot2

Map w/ plotly

Map w/ leaflet

Appendix I: Projects

Click here

Project 1. Based on the world happiness data (2017-2020) here https://en.wikipedia.org/wiki/World_Happiness_Report, explore the following 5 questions:

What are the top 10 happiest countries in the most recent year available? Create a barplot arranged in decreasing order by height.
How has the happiness score changed over the 4 years for China, India, USA, Indonesia, Japan, and Russia? Create one visual with legends.
Which countries have consistently ranked as the top 10 happiest countries across the 4 years?
Is there a correlation between GDP per capita and happiness score among countries? Does it differ by year? Create a scatter plot with GDP per capita on the x-axis, happiness score on the y-axis, and year as a legend title, with each point representing a country. You might need to merge the 4 different datasets. There is an R package “sqldf” and you can use the sqldf() function for this purpose. Ask the instructor or ChatGPT or someone you trust for help.
Based on the 2020 data, how does the distribution of happiness scores vary across different regions of the world? Create box plots or violin plots displaying the distribution of happiness scores for each region, helping to compare their medians, quartiles, and outliers. The hard part is to define the regions. Luckily, you can use the data “GNI2014” from package treemap to find a clue.
Based on the 2020 data, what is the relationship between social support and life expectancy in terms of their impact on happiness? Create a bubble plot with life expectancy on the x-axis, social support on the y-axis, and bubble size representing happiness score. This can help explore how social support and life expectancy jointly contribute to happiness.
Create a treemap for year 2020, using color index to display the countries in order of numerical ranking.
Create a choropleth map colored by the happiness score for different regions. You may make use of the following code with changes:

# Load packages ggplot2 and maps

# Create a fictional dataset with regions and happiness scores
region_data <- data.frame(
  Region = c("Europe", "Asia", "North America", "South America", "Africa", "Australia"),
  HappinessScore = c(7.5, 6.2, 7.0, 6.5, 5.8, 7.3)
)

# Get world map data, a data frame with 99338 rows and 6 columns, one of which being "region"
world_map <- map_data("world")

# Merge map data with region data
merged_data <- merge(world_map, region_data, by.x = "region", by.y = "Region", all.x = TRUE)

# Create a choropleth map colored by happiness score
ggplot(merged_data, aes(x = long, y = lat, group = group, fill = HappinessScore)) +
  geom_polygon() +
  coord_fixed(ratio = 1.25) +
  scale_fill_gradient(low = "red", high = "green", name = "Happiness Score") +
  labs(title = "Choropleth Map of Happiness Scores by Region")

In this example, the fictional dataset region_data contains regions and their corresponding happiness scores. We use the same steps as before to create a choropleth map:

Load the ggplot2 and maps packages. Obtain world map data using map_data(). Merge the map data with the region_data using merge(). Create the choropleth map using ggplot() and geom_polygon().

Project 2. Analyzing Customer Behavior:

E-commerce data containing customer behavior are here: https://www.kaggle.com/datasets/uom190346a/e-commerce-customer-behavior-dataset

Questions you can explore:

What is the distribution of total spend across different membership types? In other words, how does total spend vary by age group? Hint: Box plot or violin plot
Is there a relationship between the number of items purchased and total spend?
How do average ratings differ by city?
Visualization: Bar chart or box plot showing average rating by city
What is the correlation between days since last purchase and total spend?
What is the impact of discount application on total spend?
What is the age distribution of customers for each membership type?
How does customer satisfaction level correlate with total spend?
What are the spending patterns of different gender groups?
How does the number of items purchased vary by city?

Project 3. Sports Analytics:

All NBA players data (seasons 1996-2021) are here: https://www.kaggle.com/datasets/justinas/nba-players-data

Use data for seasons from 2005 to 2021. Choose 5 of the following questions to explore:

Career Longevity: What is the average career length of NBA players during this period? Are there differences in career lengths between players in different positions?
Age and Performance: How does the average performance (points, assists, rebounds) change with players’ age? Are there players who continued to excel at an older age?
Scoring Trends: Who were the top scorers each season, and how did their points per game vary? Is there a trend in scoring across seasons?
Three-Point Evolution: How has the average number of three-point attempts per game changed over the years? Are there players who significantly contributed to the three-point revolution?
Rookie Impact: What is the average performance of rookie players compared to players in their prime years? Are there rookies who had an exceptional impact on their teams?
Positional Analysis: Which positions (guard, forward, center) have the highest average points, rebounds, and assists per game? How have position-based performance trends changed over time?
Player Efficiency: What is the distribution of player efficiency ratings (PER) across seasons? Can you identify any outliers who consistently maintain a high PER? What about using net rating instead of PER for assessing player efficiency?
Contract Analysis: How has the average salary of NBA players changed over the years? Are there players who are highly paid but have relatively low performance?
All-Star Selections: Which players were selected as All-Stars in the most seasons during this period? Is there a correlation between All-Star selections and performance metrics?
International Impact: How has the representation of international players in the NBA evolved over time? Are there any international players who significantly impacted the league?

For those questions that can not be answered based on the given data, use Google to find the webpages where relevant information can be found. You should answer at least 5 questions for this project.

Project 4. New York City police stop-frisk data:

Data source: https://www.nyc.gov/site/nypd/stats/reports-analysis/stopfrisk.page

With the 2022 data, answer the following questions:

Demographic Analysis: What is the racial and gender distribution of individuals who were stopped and frisked by the police in 2022? Are there any disparities in stop-and-frisk encounters based on race or gender?
Temporal Trends: How have the number of stop-and-frisk encounters changed over the months of 2022? Are there specific days of the week or times of day when stop-and-frisk incidents are more common?
Reasons for Stops: What are the most common reasons cited by police officers for conducting stop-and-frisk encounters? Is there a relationship between the stated reason for the stop and the outcome of the encounter?
Outcomes of Stops: What percentage of stop-and-frisk encounters result in arrests, summonses, or other outcomes? Are there differences in outcomes based on the demographics of the individuals stopped?
Location Analysis: Which neighborhoods or precincts have the highest number of stop-and-frisk incidents? Is there a correlation between the location of stops and the demographic characteristics of the population?
Weapon and Contraband Discovery: How frequently do police officers find weapons or contraband during stop-and-frisk encounters? Is there a connection between weapon or contraband discovery and the reasons for stops?
Repeat Encounters: How many individuals have been stopped and frisked multiple times in 2022? Are there any patterns in the demographics or outcomes of repeat encounters?
Age and Stop Frequency: Is there a relationship between the age of individuals and their likelihood of being stopped and frisked? How do the reasons for stops differ across age groups?
Officer Identification: Which police officers have conducted the highest number of stop-and-frisk encounters? Are there variations in outcomes based on the officers involved?
Comparative Analysis: How does the stop-and-frisk data from 2022 compare to previous years in terms of overall numbers and demographic distributions? Have there been any changes in stop-and-frisk practices or outcomes over time?

Project 5. Analyzing Olympic Medal Data:

Olympic Games medal data for various years and sports are given here: https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-results

Here are questions to explore:

How has the distribution of gold, silver, and bronze medals changed over the years?
Visualize using stacked bar charts.
Which countries have historically dominated the Olympics in terms of total medals won? Create a bar plot.
Compare the medal counts for different sports. Which sports tend to have the most medals awarded?
Visualize the age distribution of athletes participating in the Olympics. Are there any trends over time?
Create a heatmap to show which countries perform well in specific sports. Are there any patterns?
Track the gender distribution of athletes over the years. Has there been an increase in female participation?
Plot the winning trends of a specific country’s Olympic performance over multiple editions.
Analyze the host country’s performance in the Olympics they hosted. How does it compare to their performance in other editions?
Other questions you would like to explore?

Appendix II: Writing a project report

Click here

Typically, Wen writing a project report, it should include several key sections that provide an organized and comprehensive overview of the project, its goals, methodologies, findings, and conclusions. Here’s a general outline of how a student project report could be structured:

Title Page: Informative project title, Student’s name, Course/instructor details, Date of submission
Abstract: A brief summary of the project’s objectives, methods, main findings, and conclusions.
Introduction: Introduction to the project’s topic and significance, Clear statement of the project’s objectives and research questions.
Literature Review (Not required): Overview of relevant literature or existing work related to the project’s topic, Discussion of previous research, methodologies, and findings that inform the current project.
Methodology: Description of the data sources and datasets used, Explanation of the tools, programming languages, Details about the data preprocessing steps (such as cleaning, formatting, and transformations), Explanation of the chosen data visualization techniques and why they were selected.
Results: Presentation of the visualizations created during the project. Interpretation of each visualization’s insights and implications, Addressing the research questions and objectives using the visualized data.
Discussion (Optional): Analysis and interpretation of the findings in the context of the project’s objectives, Comparison of the results with existing literature or expectations, Discussion of any challenges encountered during data visualization or analysis.
Conclusion: Recap of the main findings and insights.Implications of the findings and their relevance to the broader context, Suggestions for future research or improvements to the methodology.
References: List of all sources cited in the report, including datasets, research papers, and online resources.
Appendices: Additional information that supports the main report (e.g., code snippets, additional visualizations, detailed data descriptions), Any supplementary materials that provide further context or insights.

An example of a project report

Project Title: Exploring Movie Ratings and Revenues

Student: [Your Name]

Course: Introductory Data Visualization

Instructor: [Instructor’s Name]

Date: [Date of Submission]

Abstract: This project explores a dataset containing information about movie ratings and revenues. Using R programming, we analyze and visualize trends and relationships between movie ratings, revenues, and other variables. The analysis provides insights into the factors influencing a movie’s success in the industry.

Introduction: Movies are a significant part of our entertainment industry, with factors like ratings, genres, and marketing strategies affecting their commercial performance. This project aims to leverage R programming to analyze and visualize a movie dataset, uncovering patterns and correlations that contribute to a movie’s financial success.

Literature Review: Prior research has shown that higher-rated movies tend to attract larger audiences and generate higher revenues. Moreover, specific genres and release timings can have a considerable impact on a movie’s box office performance.

Methodology:

Data Source: IMDb and Box Office Mojo datasets. Tools: R programming language, dplyr, ggplot2. Data Preprocessing: Merging and cleaning datasets, handling missing values. Data Visualization: Creating scatter plots, bar plots, and correlation matrices using ggplot2.

Results:

Scatter Plot: Movie Ratings vs. Revenues

Visualizes the relationship between movie ratings and box office revenues. Observes that movies with higher ratings tend to have higher revenues, though exceptions exist.

Bar Plot: Average Ratings by Genre

Compares average ratings across different movie genres. Indicates genres that generally receive higher ratings.

Scatter Plot: Revenues vs. Budget

Explores the relationship between production budget and box office revenues. Identifies that higher budget movies do not necessarily result in higher revenues. Correlation Matrix

Quantifies correlations between ratings, revenues, and budget. Highlights the strengths and directions of relationships.

Discussion:

The scatter plot confirms a positive correlation between ratings and revenues, aligning with previous findings. Comedy and drama genres tend to receive higher ratings on average, while action movies show mixed ratings. The scatter plot of revenues vs. budget implies that a high budget doesn’t always ensure box office success, underscoring the influence of other factors. The correlation matrix reveals strong positive correlations between ratings and revenues, as well as between budget and revenues.

Conclusion:

This project showcases the importance of movie ratings and their role in driving revenues. Additionally, it emphasizes the necessity of considering various factors, including genres and budget, when predicting a movie’s financial performance. The insights gained can aid filmmakers, studios, and investors in making informed decisions.

References:

IMDb Dataset: [Link] Box Office Mojo Dataset: [Link] Relevant research articles and resources. Appendices:

Code snippets for data preprocessing and visualization. Additional scatter plots and charts exploring different facets of the dataset.