Data Wrangling and Visualization with R (Course 16, Hanoi - November 2024)
R Data Science Series
Course Introduction
This course provides an intensive, hands-on introduction to Data Wrangling and Data Visualization with the R programming language. You will learn the fundamental skills required to acquire, munge, transform, manipulate, and visualize data in a computing environment that fosters reproducibility.
Objectives
Manage different types of data (logical, numeric, text, integer).
Manage different data structures / data form.
Export, reshape and transform your data.
R Programming for data processing, data visualization and data science.
Import data in R from any sources (Excel, SPSS, Stata.. or Website/Internet).
Understand the basic principles behind effective data visualization.
Know why some graphs and figures work well, while others may fail to inform or actively mislead.
Know how to refine plots for effective presentation.
Know how to create a wide range of plots in R.
Know how to save, report and communicate your results.
Some Key Definitions
Data Wrangling:
Data wrangling is the process of cleaning, structuring and enriching raw data into a desired format for better decision making in less time. Data wrangling is increasingly ubiquitous at today’s top firms. Data has become more diverse and unstructured, demanding increased time spent culling, cleaning, and organizing data ahead of broader analysis.
Data Visualization:
Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.
Why Data Wrangling Is Necessary?
In spite of advances in technologies for working with data, analysts still spend an inordinate amount of time obtaining data, diagnosing data quality issues and pre-processing data into a usable form. Research has illustrated that this portion of the data analysis process is the most tedious and time consuming component. According to an article by New York Times:
Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
However, learning how to wrangle your data does not necessarily follow a linear progression. In fact, you need to start from scratch to understand how to work with data in R. Consequently, this course takes a meandering route through the data wrangling process to help build a solid data wrangling foundation.
Why Data Visualization Is Necessary?
The concept of using means of visualization to understand data has been around for centuries, from maps and graphs in the 17th century to the invention of the pie chart in the early 1800s. Several decades later, one of the most cited examples of statistical graphics occurred when Charles Minard mapped Napoleon’s invasion of Russia. The map below depicted the size of the army as well as the path of Napoleon’s retreat from Moscow – and tied that information to temperature and time scales for a more in-depth understanding of the even. The effective use of graphs and charts is an important way to explore data for yourself and to communicate your ideas and results to others. Being able to produce effective plots from data is also the best way to develop an eye for reading and understanding visualizations made by others, whether presented in academia, business, policy, or the media.
There are several reasons that explain the popularity of data visualization and its application in real world.
First and foremost, data visualization is important for a simple psychological reason: We are wired for visuals. Half of our brain is dedicated to visual functions, and 90% of the information transmitted to the brain is visual. Thus, because of the way the human brain processes information, using charts or graphs to visualize large amounts of complex data is easier than poring over spreadsheets or reports.
Second, Data visualization is a quick, easy way to convey concepts in a universal manner – and you can experiment with different scenarios by making slight adjustments. According to John Tukey - a Statistician at Princeton University:
The simple graph has brought more information to the data analyst’s mind than any other device.
Third, data visualization is one of the most important stages of any data science project. Visually-displayed data is much more accessible, and it’s critical to promptly identify the weaknesses of an organization, accurately forecast trading volumes and sale prices, or make the right business choices.
Final Products
This course will equip you with the necessary skills to create high quality plots for printing and publishing/reporting purposes by implementing in-class assignments and a capstone project.
In-Class Project 1:
In this in-class project you must clear raw data and use data clearned for creating a Population Pyramid Graph for Vietnam from 1995 to 2018:
In-Class Project 2:
This in-class project requires you replicate a Economist-style plot that shows main insights about Vietnam government spending on Public Health Care System with raw data provided by WHO with expected product:
In-Class Project 3:
This in-class project requires you replicate the plot from The 2018 Atlas of Sustainable Development Goals by World Bank:
Final Capstone Project:
This final capstone project requires you adjust/improve a plot that already published by Vietnam PCI-2018 Report with raw data provided by Vietnam PCI Project with expected product:
Data Used
All data sets used in the course are real thus describe the actual situations of data analysis in real world.
Software Used
R and RStudio will be used to perform all programming activities, assignments, and the final project. You can find details on how to download and install R 4.3.2 for Windown here, for Mac here. For RStudio you can select Windown or Mac Version, download and install from here.
Textbooks
All required classroom material will be provided in class or online. Any recommended yet optional material will also be provided in the classroom. Here are some books you may find of use throughout the course. None is required to purchase, and readings will be provided as PDFs as needed:
Garrett Grolemund & Hadley Wickham, R for Data Science.
Cole Nussbaumer Knaflic, Storytelling with Data: A Data Visualization Guide for Business Professionals.
Hadley Wickham, ggplot2, Elegant Graphics for Data Analysis.
Nathan Yau, Data Points: Visualization That Means Something.
Nathan Yau, Visualize This: The FlowingData Guide to Design, Visualization, and Statistics.
Edward Tufte, The Visual Display of Quantitative Information.
Course Information
Instructor: Nguyễn Chí Dũng.
Language: Vietnamese is used for instruction and discussion. English is required for reading material/textbooks.
A hands-on approach and focus on case studies with data from real world.
Level: From Beginner to Intermediate.
Total Time: 32 hours (03/11/2024, 10/11/2024, 17/11/2024, 24/11/2024), 8h30 - 12h30 for Morning Section, 13h 30’ - 17h 30’ for Afternoon Section.
Tuition fee: 4.000.000 (VND).
Link for registration: https://docs.google.com/forms/d/e/1FAIpQLSfP7R-96bNBjFBUdaRpA4MNFJnnps4I4hu21_twUCa5PzQ8HQ/viewform (Ms.Nhung: 085 8781 628).
Location: 7A Tôn Thất Thiệp, Cửa Đông, Hoàn Kiếm, Hà Nội (DEPOCEN headquarter).