About this course

  • This course covers how to import, clean, transform, visualize data and communicate the subsequent results using programming tools with R language or Python.

  • We will also learn how to explore data to gain useful insights.

Software Preparation - Installing R and RStudio

  • Download the latest version of R from Comprehensive R Archive Network, or CRAN (link: https://cran.r-project.org).

  • RStudio is a an Integrated development environment (IDE) for R programming. You can understand it as a user-friendly graphic interface for using R. You can download and install it from https://posit.co/download/rstudio-desktop/.

  • If you are using macOS system, make sure that you download the R/RStudio version that is compatible with your macOS version.

Introduction to R

To understand the objectives of this course better, we will study an example. However, we need to warm up ourselves with some R basics before that.

  • R is a programming language for statistical computing and graphics. Unlike Python or C, it is not a general-purpose language. It is designed to do statistical analysis.

  • Therefore, R is really designed for anyone to use, even without any programming experience. This makes some features of R very different from other programming languages.

  • Once you know how to use R, you will find how powerful it is - you can do complicated graphing or analysis jobs with very short codes.

Popularity of R

R and Python are the two most popular programming languages in Data Science. Although not as popular as Python, R is still among the top 10 popular programming languages in 2022.

R Basics - Operators in R

R has all the operators that are in Python, as below is a brief list:

Lab Exericse - Arithmetic Operators

Try the following operations in the Console of RStudio and press Enter to see the result for each line:

2 + 3
2 - 3
2 * 3
2 / 3
2 ^ 3
2 %% 3
2 %/% 3

Note that the last three operators are different from Python.

Lab Exercise - Relational Operators

Try the following operations in the Console of RStudio and press Enter to see the result for each line:

2 == 3
2 != 3
2 > 3
2 < 3
2 >= 3
2 <= 3

Note: These operators are exactly the same as in Python - the result is a logical value.

The two logical values in R are TRUE and FALSE.

R is case sensitive

Same as Python, R is case sensitive. Therefore TRUE is not the same as True. Try the following operations in RStudio Console:

TRUE
True
FALSE
False
T
F

Note: In R TRUE is the same as T, which represents being logically true. FALSE is the same as F, which represents being logically false.

Lab Exercise - Logical Operators

Try the following operations in the Console of RStudio and press Enter to see the result for each line:

!TRUE
!FALSE
TRUE & TRUE
TRUE & FALSE
TRUE | FALSE
FALSE | FALSE

Note: we will discuss the difference of & and &&, | and || later.

Lab Exercise - Assignment Operators

In R, = is still the assignment operator. However, it is not recommended to use it. Among R community, people generally prefer <- or -> as the assignment operator. Try the following operations for each line:

a <- 3
a
4 -> a
a

Note: we use <- or -> to indicate the assigning direction (left to right or right to left).

R Basics - Vectors

Vectors in R are like lists or tuples in Python. But there is one key difference, all elements in a vector in R must be of the same data type (numbers, characters or logicals etc.)

Lab Exercise - Create a vector

Try the following operations in the Console of RStudio and press Enter to see the result for each line:

c(1,3,5)
c('a', 1, 3)
c(FALSE, 1, 3)
1:10
seq(0, 10, by = 2)          # the "by" option controls the step
seq(0, 10, length.out = 4)  # the "length.out" option controls the total number of elements

Note: for a vector, when there are multiple basic data types, it will automatically convert all elements into a single data type.

Lab Exercise - Create a vector

Create the following vector in three different ways in R:

1, 3, 5, 7, 9

R Basics - Selecting vector elements

Lab Exercise - Selecting vector elements

Try the following operations in the Console of RStudio and press Enter to see the result for each line:

my_vector = c('a','p','p','l','e')
my_vector[1]
my_vector[-1]
my_vector[2:4]
my_vector[c(2,4)]

Note: In R, the first element has the index one. This is different from Python (which starts from zero).

Lab Exercise - Selecting vector elements

Try the following operations in the Console of RStudio and press Enter to see the result for each line:

x = 1:10
x[x > 5]
x[x != 5]
x[x <= 5]
x[y > 5]

Note: one can put a logical expression to select the elements from a vector where the logical expression returns TRUE. But remember to use the same vector name in the logical expression.

Review of R basics


  • Basic algebraic operators in R: +, -, *, /, ^, %%, %/%

  • Basic assignment operators in R: ->, <-

  • Basic atomic data types in R: integer, double, character and logical

  • Basic Logical values and operators: TRUE, FALSE, ==, !=, >=, <=, >, <, &&, ||, &, |

  • Create atomic vectors in R: c(), seq()

  • Help documentation: use ? or help()

  • Use functions in R: positional arguments and keyword arguments.

R Basics - Atomic Vectors


We have learned how to create vectors that contain multiple values of the same type with The c() function. These vectors are called atomic vectors, of which there are six types: logical, integer, double, character, complex, and raw. Integer and double vectors are collectively known as numeric vectors. In this course we won’t use complex and raw vectors so we ignore them here.


We can use typeof function to check the type of a vector. Try the following code in your console.

typeof(c(1,2,3))
typeof(c(1L, 2L, 3L))
typeof(c("a", "b", "c"))
typeof(c(T, F, TRUE, FALSE))

Two Key Properties for Vectors


Every vector has two key properties:

  • Its type, which you can determine with typeof().
typeof(letters)
typeof(1:10)


  • Its length, which you can determine with length().
x <- seq(1, 20, by = 3)
length(x)


There’s one other related object: NULL. NULL is often used to represent the absence of a vector (as opposed to NA which is used to represent the absence of a value in a vector). NULL typically behaves like a vector of length 0.

length(NULL)

Names for Vector Elements

In a vector, each element can be assigned a label, for example

grades <- c(James = 79, Jane = 85, Jack = 91)
grades
## James  Jane  Jack 
##    79    85    91

Here the values in grades are the numbers 79, 85, 91, but now each value is assigned with a student name. The labels of a vector is another character vector. We can retrieve them by the function names:

names(grades)
## [1] "James" "Jane"  "Jack"

Its type is character.

student_names <- names(grades)
typeof(student_names)
## [1] "character"

Functions basics in R


To use a function, we need to understand the following basic concepts:

  • Argument: inputs of a function

  • Required Argument: If a required argument is missing, the function will print an error message

  • Optional Argument: One doesn’t have to input an optional argument; if it is missing, a default value will be taken.

  • Positional Argument: Arguments that have to input in a specific order

  • Keyword Argument: Arguments that are specified by a keyword (usually optional with a default value)

Example

my_data <- c(1, 2, 2, 5, 10, NA)
mean(my_data, trim = 0, na.rm = TRUE)
## [1] 4

In the example above, we call the function mean() to computer the data average. Let’s check the help documentation of the function mean():

help("mean")

So we see that there are a few arguments for the function:

  • x keyword - the input data: This is the first argument; this argument is required and there is no default value.

  • trim keyword - the fraction of observations to be trimmed from each end of x before the mean is computed: This is the second argument, with the default value of being 0

  • na.rm keyword - a logical value being TRUE of FALSE indicating whether NA values are removed before the mean is computed: This is the third argument, with the default value of being FALSE.

Lab Exercise


Try the following code and see what it gives to you. Can you explain why the result is what you observed?

mean(trim = 0, na.rm = T)
mean(my_data)
mean(my_data, trim = 0.2)
mean(my_data, trim = 0.2, na.rum = T)

R Basics - List


To create a “vector” mixed with different data types in R, we need to create a list using the list() function:

my_list <- list(student_name = c('James', 'Jane', 'Jack'), student_grade = c(79, 85, 91))


Here the first element in the list is a character vector, with the label student_name. The second element in the list is a double vector, with the label student_grade. To retrieve an element in a list, we usually use the $ along with the label.

my_list$student_name
## [1] "James" "Jane"  "Jack"
my_list$student_grade
## [1] 79 85 91

Data Frames


In this course, we will mostly handle rectangular data, or tabular data, which are simply data in spreadsheets with rows being samples and columns being features. As below is an example:

my_data <- data.frame(name = c('James', 'Alice', 'Lucy'), Math_grade = c(80, 90, 100), English_grade = c(100, 90, 80))
my_data
##    name Math_grade English_grade
## 1 James         80           100
## 2 Alice         90            90
## 3  Lucy        100            80


In this simple example, we have three students and their corresponding scores in two subjects. So each row is a sample (or observation), and each column is an attribute (usually called features or variables) for each sample.

Data Frames Are Special Lists


In R, Data frames and tibbles (we will learn this later) are built on top of lists. You can understand data frames as lists with the following properties:


Again, we have the two basic functions length() and names() to retrieve these information for a data frame.

length(my_data)
## [1] 3
names(my_data)
## [1] "name"          "Math_grade"    "English_grade"

Note that here the length() function returns the number of columns (features). To get the number of rows, we need the function nrow()

nrow(my_data)
## [1] 3

R Basics - Basic Operations of Data Frames


Lab Exercise - Basic Operations of Data Frames


Try the following commands in RStudio console. Before you execute the code, think in your mind what you expect to be the result:

View(my_data)
my_data$name
my_data$Math_grade
my_data$English_grade
my_data[1, ]
my_data[2, ]
my_data[3, 3]

R Basics - install and load packages


In this course, we are going to analyze many data sets from real world. Like import in Python, we also need to load other R packages to access data sets and functions.


One of the popular R package collection for Data Science is tidyverse. It includes packages such as ggplot2, dplyr, forcats, readr, tidyr, stringr to do different jobs in Data science, including data importing, tidying, transformation, visualization and others.


For more details, please refer to https://www.tidyverse.org/. To install the package, simply run the following commands:

install.packages("tidyverse")

You will not be able to use the functions, objects, and help files in a package until you load it with library(). Once you have installed a package, you can load it with the library() function:

library(tidyverse)


This tells you that tidyverse is loading the ggplot2, tibble, tidyr, readr, purrr, and dplyr packages. These are considered to be the core of the tidyverse because you’ll use them in almost every analysis.


Packages in the tidyverse change fairly frequently. You can see if updates are available, and optionally install them, by running tidyverse_update().

Access a data set from packages


After we load the package tidyverse, many data sets become accessible. We can simply access the data set by their names. For example, today we will take a look at the data set named mpg.


mpg is a data set of fuel economy from 1999 to 2008 for 38 popular model of cars. More details about the data set can be found at https://ggplot2.tidyverse.org/reference/mpg.html.


Actually, we can simply access the help document in R as well with the following command.

?mpg

The mpg Data set


Now let’s have a complete view of the data set by using the command:

View(mpg)


Or you may use the following commands as well

mpg
glimpse(mpg)

The good thing about glimpse(mpg) is that it will list all columns. If you only use mpg, when there are too many columns, some of them will be suppressed.

A glimpse of mpg


glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…


Lab Questions:


  1. How many columns (features/varaibles) are there? How many rows (samples) are there?

  2. What is the meaning of each variable?

  3. What is the average mpg in city for all car models in the data set?

Example of Data Visualization and Exploration


In the next, we won’t learn any new commands. But I will show you what we will learn to do in this course using mpg as an example.


Please don’t worry about how the following figures are generated, which we will learn starting the next class.


Data exploration refers to the practice of finding useful patterns or trends in a given data set, and plotting graphs to visualize data is one of the most important ways to explore data.

Example of Data Visualization and Exploration


For example, Give cty being the miles per gallon in city, and hwy being miles per gallon in highway. What do the following plot indicate? Why does it look like an increasing line?

Example of Data Visualization and Exploration


Now let’s explore the effect of drive train type (f = front-wheel drive, r = rear-wheel drive, 4 = 4-wheel drive) on fuel efficiency. What conclusion can you draw from the following graph? Which type is most fuel economic?

Example of Data Visualization and Exploration


Now let’s explore the effect of number of cylinders on fuel efficiency. What conclusion can you draw from the following graph?

Basic Concepts in Statistics


When we use graphs to summarize and visualize data, we are doing descriptive statistics. To understand what it is, we need to go through some basic concepts of statistics.

Let’s start our study with looking at a statement that can be commonly seen in news media:

A poll for 1,247 voters shows that 48% of US voters support the republican candidate, with a 3% margin of error.


Clearly, this is a statement based on statistics. As below are the key points to understand this statement:

Population and Sample


Lab Exercise


Statement: A study of survival of 1,225 newly diagnosed breast cancer cases finds that the average seven-year survival rates for Stage I breast cancer was 92%“.

seven-year survival rates: the percentage of patients that survive seven years after diagnosis.

Descriptive vs Inferential Statistics



Lab Exercises


A study shows that 71.6% of US adults are overweight. Answer the following question:

Types of random variables


Lab Exercise


Give a real example of different variable types:

Introduction to data plots


Before we start to plot graphs, we need to review the basic knowledge of data plotting types. There are many types of them, and as below are a few examples, including some most commonly used ones:

Plot types depend on data types


Why do we have this many plot types? One reason is that we need different plots to best illustrate the relationship between (usually one or two) variables of different types.

Example of a bar plot


Next, we will use mpg data set to give examples of each plot type to explain their meaning.

A bar plot is to show the distribution of one categorical variable.

ggplot(mpg) + 
  geom_bar(aes(x = drv))

Example of a histogram


A histogram is to show the distribution of one numeric variable (discrete or continuous).

ggplot(mpg) + 
  geom_histogram(aes(x = hwy), border = 5, binwidth = 5)

Example of a box plot


A boxplot is to show a five-number summary of a numeric variable (discrete or continous).

ggplot(mpg) + 
  geom_boxplot(aes(x = hwy))

Example of a scatter plot


A scatter plot is usually to show the relationship between two numeric variables.

ggplot(mpg) + 
  geom_point(aes(x = hwy, y = cty))

Example of a multiple box plot


A multiple box plot is usually to show the relationship between one categorical variable and one numeric variable.

ggplot(mpg) + 
  geom_boxplot(aes(x = drv, y = cty))

Example of stacked bar plot


A stacked bar plot is usually to show the relationship between two categorical variables.

ggplot(data = mpg) + 
  geom_bar(mapping = aes(x = drv, fill = class))

Lab Homework


  1. Finish all lab exercises. Submit your R code.

  2. Compute \[ \left( \frac{2206 \sqrt{2}}{99^2} \right)^{-1} \]. Submit your answer in a number with three decimal places.

  3. What is the result of the following code (don’t execute the code, tell by inspection):

x <- 1
2 <- y
(x == y) | (x < y)
  1. Create a vector containing 1,3,5,7,9 in three different ways. Give your codes.

  2. Use the help documentation, find the factorial function and the inverse sine function in R. Submit the function name and give an example.

  3. Create a data frame with the first column being years from 1999 to 2002, and the second column being the day of the week (Monday, Tuesday, etc.) for January 1st of that year. Submit your code and show the result.