About this course


  • This course covers how to import, clean, transform, visualize data and communicate the subsequent results using programming tools with R language or Python.

  • We will also learn how to explore data to gain useful insights.

Software Preparation - Installing R and RStudio


  • Download the latest version of R from Comprehensive R Archive Network, or CRAN (link: https://cran.r-project.org).

  • RStudio is a an Integrated development environment (IDE) for R programming. You can understand it as a user-friendly graphic interface for using R. You can download and install it from http://www.rstudio.com/download.

  • If you are using macOS system, make sure that you download the R/RStudio version that is compatible with your macOS version.

A real example is better than ten thousand words


For this introductory course, we will go through real examples to show you a relatively comprehensive process of gaining insights from data, including

  • data collection

  • data cleaning

  • data transformation

  • data visualization

  • data modeling

  • data communication.

Data Collection/Cleaning typically occupy 80%-90% of the time.


  • As we will see in this example.

Define your task


  • Now let’s imagine that our college is planning for new online master’s programs in data science. But a key question is to decide what is our sticker price per credit. You are given the task to do some research work on the pricing of online data science programs.


Your task: Collect data about online master’s programs in data science, then write a report to your supervisor that recommends a pricing range and your reasoning to do so.

First step: data collection


  • The first step is to try to collect data on the question. We will touch a little bit web scarping for this course, but it’s not the focus of this course.


Lab Exercise: Try to find data that may be helpful to answer your question.

Data Collection - Example


  • Let’s do some simple data collection process without the need of programming. The US News offers free data of online master’s programs of information technology specialty.



  • It seems that there is no “data science” concentration. So we have a few options here. To collect sufficient amount of data, it may be a good idea to simply collect data from all concentrations of IT field.

Data Cleaning


  • After we find some public data source, the next step is to “clean data”.


  • In this example, we simply copy and paste all the raw texts on the webpage. Obviously, the data is “messy” and not ready to use.


  • Our job would be to keep useful information from the text and arrange it in a tabular form (clean data).


Lab Exercise: If you are to do this job without programming, what are you going to do?

Data Cleaning - Example step I


  • With the development of generative AI tools such as ChatGPT and Cloude, we no longer have to this by ourselves for simple and small data sets.

  • (Shown in class) Use Cloude to arrange raw text data into a table that can be copies into an csv.

Data Cleaning - Step II


  • The data are still unclean and not ready for analysis yet. Can you see why?

Data Cleaning - Step II


  • We hope to remove the texts in the last two columns and only keep number

Data Import


  • Before we clean the data further with R, we need to import data into Rstudio.

  • We need to first create a “.csv” file and then import into RStudio as a data frame (tabular data).

  • After doing that, we can now clean data with powerful built-in functions from R.

  • For this part, let’s not worry about technical details (which you will learn later), and simply focus on experiencing the process and steps in data analysis.

Data Transformation


  • After cleaning the data, we can now think of our analysis plan. In some cases, we may need to further work on data before making plots.

  • For example, Now that we have data for the number of enrollments and the price of each credit, we may create a new column named “revenue” if we assume that all master’s program has around 36 credits in total (which is a quite reasonable assumption).

  • As we see, domain knowledge is needed to take such meaningful actions in creating new variables - this is called data transformation.

Data Visualization


Now let’s analyze what factor may be related to the revenue. We may plot the program rank from US news with revenue.

Question: What can you learn from this graph?

Data Modeling


It seems that the revenue has a negative relationship with the Rank. Let’s fit data to a linear model.

Question: The fit does show a negative relationship. Do you think the trend is reliable?

A Better visualization


We want to add uncertainty (CI) of the model prediction. We may simply use some more advanced visualization.

Question: What do you think now about the relationship between estimated revenue and Rank?

Introduce the second data set


It seems that the revenue does not have a strong relationship with US News rank in online IT master’s program. After some consideration, it may make more sense to analyze the effect of university reputation, which can be somehow measured by US News Rank of universities. Now we need a second data set:

Data transformation - working on multiple data sets


Following the steps above, now we can make the second data set into a “csv” file as well. Now we have a new task - check the university names from the second data set, and add that to the first dat aset.


https://www.usnews.com/best-colleges/rankings/national-universities

Visualize and modeling again


Now we can repeat the visualization and modeling process with the new variable.

We may also analyze the difference between reputable (US News rank 1~151) and non-reputable (US News rank 151+) schools


Finally, wrap up your report in a presentation format or notebook/document format


  • We will learn R markdown to conveniently write notebook/document report to summarize your findings.


https://rmarkdown.rstudio.com/gallery.html

The Usefulness of Data Visualization



  • Visualisation is a fundamentally human activity. A good visualisation will show you things that you did not expect, or raise new questions about the data.

  • A good visualisation might also hint that you’re asking the wrong question, or you need to collect different data.

  • Visualisations can surprise you, but don’t scale particularly well because they require a human to interpret them.

The Usefulness of Data Modeling



  • Models are complementary tools to visualisation. Once you have made your questions sufficiently precise, you can use a model to answer them.

  • Models are a fundamentally mathematical or computational tool, so they generally scale well. Even when they don’t, it’s usually cheaper to buy more computers than it is to buy more brains!

  • But every model makes assumptions, and by its very nature a model cannot question its own assumptions. That means a model cannot fundamentally surprise you.

The EDA (Exploratory Data Analysis) Cycle


EDA is an iterative cycle. You:

  • Generate questions about your data.

  • Search for answers by visualising, transforming, and modelling your data.

  • Use what you learn to refine your questions and/or generate new questions. You may need to find new data to answer them.

Summary


  • Data Science is about problem-solving.

  • Statistical knowledge and programming skills are useful and efficient tools to help us resolve real-world challenges.

  • But as you see from the example, knowledge and skills need to be put into practice with your common sense, domain knowledge and critical thinking.

  • In this course, we will learn and exercise by working on real-world data.

  • We will focus on data import, cleaning, visualization and exploration, but will also touch data collection and modeling to a limited extent.