This course covers how to import, clean, transform, visualize data and communicate the subsequent results using programming tools with R language or Python.
We will also learn how to explore data to gain useful insights.
Download the latest version of R from Comprehensive R Archive Network, or CRAN (link: https://cran.r-project.org).
RStudio is a an Integrated development environment (IDE) for R programming. You can understand it as a user-friendly graphic interface for using R. You can download and install it from http://www.rstudio.com/download.
If you are using macOS system, make sure that you download the R/RStudio version that is compatible with your macOS version.
For this introductory course, we will go through real examples to show you a relatively comprehensive process of gaining insights from data, including
data collection
data cleaning
data transformation
data visualization
data modeling
data communication.
As we will see in this example.
It seems that there is no “data science” concentration. So we have a few options here. To collect sufficient amount of data, it may be a good idea to simply collect data from all concentrations of IT field.
With the development of generative AI tools such as ChatGPT and Cloude, we no longer have to this by ourselves for simple and small data sets.
(Shown in class) Use Cloude to arrange raw text data into a table that can be copies into an csv.
Before we clean the data further with R, we need to import data into Rstudio.
We need to first create a “.csv” file and then import into RStudio as a data frame (tabular data).
After doing that, we can now clean data with powerful built-in functions from R.
For this part, let’s not worry about technical details (which you will learn later), and simply focus on experiencing the process and steps in data analysis.
After cleaning the data, we can now think of our analysis plan. In some cases, we may need to further work on data before making plots.
For example, Now that we have data for the number of enrollments and the price of each credit, we may create a new column named “revenue” if we assume that all master’s program has around 36 credits in total (which is a quite reasonable assumption).
As we see, domain knowledge is needed to take such meaningful actions in creating new variables - this is called data transformation.
Now let’s analyze what factor may be related to the revenue. We may plot the program rank from US news with revenue.
Question: What can you learn from this graph?
It seems that the revenue has a negative relationship with the Rank. Let’s fit data to a linear model.
Question: The fit does show a negative relationship. Do you think the trend is reliable?
We want to add uncertainty (CI) of the model prediction. We may simply use some more advanced visualization.
Question: What do you think now about the relationship between estimated revenue and Rank?
It seems that the revenue does not have a strong relationship with US News rank in online IT master’s program. After some consideration, it may make more sense to analyze the effect of university reputation, which can be somehow measured by US News Rank of universities. Now we need a second data set:
Following the steps above, now we can make the second data set into a “csv” file as well. Now we have a new task - check the university names from the second data set, and add that to the first dat aset.
https://www.usnews.com/best-colleges/rankings/national-universities
Now we can repeat the visualization and modeling process with the new variable.
Visualisation is a fundamentally human activity. A good visualisation will show you things that you did not expect, or raise new questions about the data.
A good visualisation might also hint that you’re asking the wrong question, or you need to collect different data.
Visualisations can surprise you, but don’t scale particularly well because they require a human to interpret them.
Models are complementary tools to visualisation. Once you have made your questions sufficiently precise, you can use a model to answer them.
Models are a fundamentally mathematical or computational tool, so they generally scale well. Even when they don’t, it’s usually cheaper to buy more computers than it is to buy more brains!
But every model makes assumptions, and by its very nature a model cannot question its own assumptions. That means a model cannot fundamentally surprise you.
EDA is an iterative cycle. You:
Generate questions about your data.
Search for answers by visualising, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions. You may need to find new data to answer them.
Data Science is about problem-solving.
Statistical knowledge and programming skills are useful and efficient tools to help us resolve real-world challenges.
But as you see from the example, knowledge and skills need to be put into practice with your common sense, domain knowledge and critical thinking.
In this course, we will learn and exercise by working on real-world data.
We will focus on data import, cleaning, visualization and exploration, but will also touch data collection and modeling to a limited extent.