About this course

This course covers how to import, clean, transform, visualize data and communicate the subsequent results using programming tools with R language or Python.
We will also learn how to explore data to gain useful insights.

Software Preparation - Installing R and RStudio

Download the latest version of R from Comprehensive R Archive Network, or CRAN (link: https://cran.r-project.org).
RStudio is a an Integrated development environment (IDE) for R programming. You can understand it as a user-friendly graphic interface for using R. You can download and install it from https://posit.co/download/rstudio-desktop/.
If you are using macOS system, make sure that you download the R/RStudio version that is compatible with your macOS version.

Introduction to R

To understand the objectives of this course better, we will study an example. However, we need to warm up ourselves with some R basics before that.

R is a programming language for statistical computing and graphics. Unlike Python or C, it is not a general-purpose language. It is designed to do statistical analysis.
Therefore, R is really designed for anyone to use, even without any programming experience. This makes some features of R very different from other programming languages.
Once you know how to use R, you will find how powerful it is - you can do complicated graphing or analysis jobs with very short codes.

Popularity of R

R and Python are the two most popular programming languages in Data Science. Although not as popular as Python, R is still among the top 10 popular programming languages in 2022.

R Basics - Operators in R

R has all the operators that are in Python, as below is a brief list:

Lab Exericse - Arithmetic Operators

Try the following operations in the Console of RStudio and press Enter to see the result for each line:

2 + 3
2 - 3
2 * 3
2 / 3
2 ^ 3
2 %% 3
2 %/% 3

Note that the last three operators are different from Python.

Lab Exercise - Relational Operators

Try the following operations in the Console of RStudio and press Enter to see the result for each line:

2 == 3
2 != 3
2 > 3
2 < 3
2 >= 3
2 <= 3

Note: These operators are exactly the same as in Python - the result is a logical value.

The two logical values in R are TRUE and FALSE.

R is case sensitive

Same as Python, R is case sensitive. Therefore TRUE is not the same as True. Try the following operations in RStudio Console:

TRUE
True
FALSE
False
T
F

Note: In R TRUE is the same as T, which represents being logically true. FALSE is the same as F, which represents being logically false.

Lab Exercise - Logical Operators

Try the following operations in the Console of RStudio and press Enter to see the result for each line:

!TRUE
!FALSE
TRUE & TRUE
TRUE & FALSE
TRUE | FALSE
FALSE | FALSE

Note: we will discuss the difference of & and &&, | and || later.

Lab Exercise - Assignment Operators

In R, = is still the assignment operator. However, it is not recommended to use it. Among R community, people generally prefer <- or -> as the assignment operator. Try the following operations for each line:

a <- 3
a
4 -> a
a

Note: we use <- or -> to indicate the assigning direction (left to right or right to left).

R Basics - Vectors

Vectors in R are like lists or tuples in Python. But there is one key difference, all elements in a vector in R must be of the same data type (numbers, characters or logicals etc.)

Lab Exercise - Create a vector

Try the following operations in the Console of RStudio and press Enter to see the result for each line:

c(1,3,5)
c('a', 1, 3)
c(FALSE, 1, 3)
1:10
seq(0, 10, by = 2)          # the "by" option controls the step
seq(0, 10, length.out = 4)  # the "length.out" option controls the total number of elements

Note: for a vector, when there are multiple basic data types, it will automatically convert all elements into a single data type.

Lab Exercise - Create a vector

Create the following vector in three different ways in R:

1, 3, 5, 7, 9

R Basics - Selecting vector elements

Lab Exercise - Selecting vector elements

Try the following operations in the Console of RStudio and press Enter to see the result for each line:

my_vector = c('a','p','p','l','e')
my_vector[1]
my_vector[-1]
my_vector[2:4]
my_vector[c(2,4)]

Note: In R, the first element has the index one. This is different from Python (which starts from zero).

Lab Exercise - Selecting vector elements

Try the following operations in the Console of RStudio and press Enter to see the result for each line:

x = 1:10
x[x > 5]
x[x != 5]
x[x <= 5]
x[y > 5]

Note: one can put a logical expression to select the elements from a vector where the logical expression returns TRUE. But remember to use the same vector name in the logical expression.

Review of R basics

Basic algebraic operators in R: +, -, *, /, ^, %%, %/%
Basic assignment operators in R: ->, <-
Basic atomic data types in R: integer, double, character and logical
Basic Logical values and operators: TRUE, FALSE, ==, !=, >=, <=, >, <, &&, ||, &, |
Create atomic vectors in R: c(), seq()
Help documentation: use ? or help()
Use functions in R: positional arguments and keyword arguments.

R Basics - Atomic Vectors

We have learned how to create vectors that contain multiple values of the same type with The c() function. These vectors are called atomic vectors, of which there are six types: logical, integer, double, character, complex, and raw. Integer and double vectors are collectively known as numeric vectors. In this course we won’t use complex and raw vectors so we ignore them here.

We can use typeof function to check the type of a vector. Try the following code in your console.

typeof(c(1,2,3))
typeof(c(1L, 2L, 3L))
typeof(c("a", "b", "c"))
typeof(c(T, F, TRUE, FALSE))

Two Key Properties for Vectors

Every vector has two key properties:

Its type, which you can determine with typeof().

typeof(letters)
typeof(1:10)

Its length, which you can determine with length().

x <- seq(1, 20, by = 3)
length(x)

There’s one other related object: NULL. NULL is often used to represent the absence of a vector (as opposed to NA which is used to represent the absence of a value in a vector). NULL typically behaves like a vector of length 0.

length(NULL)

Names for Vector Elements

In a vector, each element can be assigned a label, for example

grades <- c(James = 79, Jane = 85, Jack = 91)
grades

## James  Jane  Jack 
##    79    85    91

Here the values in grades are the numbers 79, 85, 91, but now each value is assigned with a student name. The labels of a vector is another character vector. We can retrieve them by the function names:

names(grades)

## [1] "James" "Jane"  "Jack"

Its type is character.

student_names <- names(grades)
typeof(student_names)

## [1] "character"

Functions basics in R

To use a function, we need to understand the following basic concepts:

Argument: inputs of a function
Required Argument: If a required argument is missing, the function will print an error message
Optional Argument: One doesn’t have to input an optional argument; if it is missing, a default value will be taken.
Positional Argument: Arguments that have to input in a specific order
Keyword Argument: Arguments that are specified by a keyword (usually optional with a default value)

Example

my_data <- c(1, 2, 2, 5, 10, NA)
mean(my_data, trim = 0, na.rm = TRUE)

## [1] 4

In the example above, we call the function mean() to computer the data average. Let’s check the help documentation of the function mean():

help("mean")

So we see that there are a few arguments for the function:

x keyword - the input data: This is the first argument; this argument is required and there is no default value.
trim keyword - the fraction of observations to be trimmed from each end of x before the mean is computed: This is the second argument, with the default value of being 0
na.rm keyword - a logical value being TRUE of FALSE indicating whether NA values are removed before the mean is computed: This is the third argument, with the default value of being FALSE.

Lab Exercise

Try the following code and see what it gives to you. Can you explain why the result is what you observed?

mean(trim = 0, na.rm = T)
mean(my_data)
mean(my_data, trim = 0.2)
mean(my_data, trim = 0.2, na.rum = T)

R Basics - List

To create a “vector” mixed with different data types in R, we need to create a list using the list() function:

my_list <- list(student_name = c('James', 'Jane', 'Jack'), student_grade = c(79, 85, 91))

Here the first element in the list is a character vector, with the label student_name. The second element in the list is a double vector, with the label student_grade. To retrieve an element in a list, we usually use the $ along with the label.

my_list$student_name

## [1] "James" "Jane"  "Jack"

my_list$student_grade

## [1] 79 85 91

Lecture 1 - Introduction to R and Review of Descriptive Statistics

Miao Yu

2025-01-15

About this course

Software Preparation - Installing R and RStudio

Introduction to R

Popularity of R

R Basics - Operators in R

Lab Exericse - Arithmetic Operators

Lab Exercise - Relational Operators

R is case sensitive

Lab Exercise - Logical Operators

Lab Exercise - Assignment Operators

R Basics - Vectors

Lab Exercise - Create a vector

Lab Exercise - Create a vector

R Basics - Selecting vector elements

Lab Exercise - Selecting vector elements

Lab Exercise - Selecting vector elements

Review of R basics

R Basics - Atomic Vectors

Two Key Properties for Vectors

Names for Vector Elements

Functions basics in R

Example

Lab Exercise

R Basics - List

Data Frames

Data Frames Are Special Lists

R Basics - Basic Operations of Data Frames

Lab Exercise - Basic Operations of Data Frames

R Basics - install and load packages

Access a data set from packages

The mpg Data set

A glimpse of mpg

Example of Data Visualization and Exploration

Example of Data Visualization and Exploration

Example of Data Visualization and Exploration

Example of Data Visualization and Exploration

Basic Concepts in Statistics

Population and Sample

Lab Exercise

Descriptive vs Inferential Statistics

Lab Exercises

Types of random variables

Lab Exercise

Introduction to data plots

Plot types depend on data types

Example of a bar plot

Example of a histogram

Example of a box plot

Example of a scatter plot

Example of a multiple box plot

Example of stacked bar plot

Lab Homework

The `mpg` Data set

A glimpse of `mpg`