About this course

This course covers how to import, clean, transform, visualize data and communicate the subsequent results using programming tools with R language or Python.
We will also learn how to explore data to gain useful insights.

Software Preparation - Installing R and RStudio

Download the latest version of R from Comprehensive R Archive Network, or CRAN (link: https://cran.r-project.org).
RStudio is a an Integrated development environment (IDE) for R programming. You can understand it as a user-friendly graphic interface for using R. You can download and install it from http://www.rstudio.com/download.
If you are using macOS system, make sure that you download the R/RStudio version that is compatible with your macOS version.

Introduction to R

To understand the objectives of this course better, we will study an example. However, we need to warm up ourselves with some R basics before that.

R is a programming language for statistical computing and graphics. Unlike Python or C, it is not a general-purpose language. It is designed to do statistical analysis.
Therefore, R is really designed for anyone to use, even without any programming experience. This makes some features of R very different from other programming languages.
Once you know how to use R, you will find how powerful it is - you can do complicated graphing or analysis jobs with very short codes.

Popularity of R

R and Python are the two most popular programming languages in Data Science. Although not as popular as Python, R is still among the top 10 popular programming languages in 2022.

R Basics - Operators in R

R has all the operators that are in Python, as below is a brief list:

Exericse 1 - Arithmetic Operators

Try the following operations in the Console of RStudio and press Enter to see the result for each line:

2 + 3
2 - 3
2 * 3
2 / 3
2 ^ 3
2 %% 3
2 %/% 3

Note that the last three operators are different from Python.

Exercise 2 - Relational Operators

Try the following operations in the Console of RStudio and press Enter to see the result for each line:

2 == 3
2 != 3
2 > 3
2 < 3
2 >= 3
2 <= 3

Note: These operators are exactly the same as in Python - the result is a logical value.

The two logical values in R are TRUE and FALSE.

R is case sensitive

Same as Python, R is case sensitive. Therefore TRUE is not the same as True. Try the following operations in RStudio Console:

TRUE
True
FALSE
False
T
F

Note: In R TRUE is the same as T, which represents being logically true. FALSE is the same as F, which represents being logically false.

Exercise 3 - Logical Operators

Try the following operations in the Console of RStudio and press Enter to see the result for each line:

!TRUE
!FALSE
TRUE & TRUE
TRUE & FALSE
TRUE | FALSE
FALSE | FALSE

Note: we will discuss the difference of & and &&, | and || later.

Exercise 4 - Assignment Operators

In R, = is still the assignment operator. However, it is not recommended to use it. Among R community, people generally prefer <- or -> as the assignment operator. Try the following operations for each line:

a <- 3
a
4 -> a
a

Note: we use <- or -> to indicate the assigning direction (left to right or right to left).

R Basics - Vectors

Vectors in R are like lists or tuples in Python. But there is one key difference, all elements in a vector in R must be of the same data type (numbers, characters or logicals etc.)

Exercise 5 - Create a vector

Try the following operations in the Console of RStudio and press Enter to see the result for each line:

c(1,3,5)
c('a', 1, 3)
c(FALSE, 1, 3)
1:10
seq(0, 10, by = 2)          # the "by" option controls the step
seq(0, 10, length.out = 4)  # the "length.out" option controls the total number of elements

Note: for a vector, when there are multiple basic data types, it will automatically convert all elements into a single data type.

Exercise 6 - Create a vector

Create the following vector in three different ways in R:

1, 3, 5, 7, 9

R Basics - Selecting vector elements

Exercise 7 - Selecting vector elements

Try the following operations in the Console of RStudio and press Enter to see the result for each line:

my_vector = c('a','p','p','l','e')
my_vector[1]
my_vector[-1]
my_vector[2:4]
my_vector[c(2,4)]

Note: In R, the first element has the index one. This is different from Python (which starts from zero).

Exercise 8 - Selecting vector elements

Try the following operations in the Console of RStudio and press Enter to see the result for each line:

x = 1:10
x[x > 5]
x[x != 5]
x[x <= 5]
x[y > 5]

Note: one can put a logical expression to select the elements from a vector where the logical expression returns TRUE. But remember to use the same vector name in the logical expression.

R Basics - List

To create a “vector” mixed with different data types in R, we need to create a list using the list() function:

my_list <- list(name = c('James', 'Jane'), grade = c(88, 91))

You can assign a label to each element in the list, which we will see how to use it on the next page.

Use mode() function to check the basic data type (numeric, logical, character, list) of a variable.

mode(my_list)

## [1] "list"

R Basics - Access Elements in a List

We can either use the index or the label to access an element of a list using the operator $. Usually in practice we use labels. Later we will see that how lists are useful to store results of a statistical analyses.

my_list[1]

## $name
## [1] "James" "Jane"

my_list$grade

## [1] 88 91

R Basics - Functions

R functions are very similar to Python functions - its syntax is function_name(positional_args, keyword_args = value, ...) (In R, keyword arguments can go first).

As below are some frequently used math functions.

R Basics - Help Documentation

In R, we either use the help() function or simply use ?function_name or ?data_name. Try the following commands in Rstudio console.

help(mean)
help('mean')
?mean
help('%in%')
?%in%

Note: For operators, you cannot use ?, and can only use help().

R Basics - Data Frames

In this course, we will mostly handle rectangular data, or tabular data, which are simply data in spreadsheets with rows being samples and columns being features. As below is an example:

my_data <- data.frame(name = c('James', 'Alice', 'Lucy'), Math_grade = c(80, 90, 100), English_grade = c(100, 90, 80))
my_data

##    name Math_grade English_grade
## 1 James         80           100
## 2 Alice         90            90
## 3  Lucy        100            80

In this simple example, we have three students and their corresponding scores in two subjects. So each row is a sample (or observation), and each column is an attribute (usually called features or variables) for each sample.

R Basics - Basic Operations of Data Frames

Exercise 9 - Basic Operations of Data Frames

Try the following commands in RStudio console:

View(my_data)
my_data$name
my_data$Math_grade
my_data$English_grade
my_data[1, ]
my_data[2, ]
my_data[3, 3]

Note: Each column or row of a data frame is a vector.

R Basics - install and load packages

In this course, we are going to analyze many data sets from real world. Like import in Python, we also need to load other R packages to access data sets and functions.

One of the popular R package collection for Data Science is tidyverse. It includes packages such as ggplot2, dplyr, forcats, readr, tidyr, stringr to do different jobs in Data science, including data importing, tidying, transformation, visualization and others.

For more details, please refer to https://www.tidyverse.org/. To install the package, simply run the following commands:

install.packages("tidyverse")

R Basics - install and load packages

You will not be able to use the functions, objects, and help files in a package until you load it with library(). Once you have installed a package, you can load it with the library() function:

library(tidyverse)

This tells you that tidyverse is loading the ggplot2, tibble, tidyr, readr, purrr, and dplyr packages. These are considered to be the core of the tidyverse because you’ll use them in almost every analysis.

Packages in the tidyverse change fairly frequently. You can see if updates are available, and optionally install them, by running tidyverse_update().

Other packages

In this course we’ll use other data packages from outside the tidyverse. Let’s install them altogether using the following command:

install.packages(c("nycflights13", "gapminder", "Lahman"))

After installing any package, try to load them in RStudio Console to make sure that you have installed them correctly and they are available to use.

Access a data set from packages

After we load the package tidyverse, many data sets become accessible. We can simply access the data set by their names. For example, today we will take a look at the data set named mpg.

mpg is a data set of fuel economy from 1999 to 2008 for 38 popular model of cars. More details about the data set can be found at https://ggplot2.tidyverse.org/reference/mpg.html.

Actually, we can simply access the help document in R as well with the following command.

?mpg

Question: How many samples do we have in the data set? How many variables?

The `mpg` Data set

Now let’s have a complete view of the data set by using the command:

View(mpg)

Or you may use the following commands as well

mpg
glimpse(mpg)

The good thing about glimpse(mpg) is that it will list all columns. If you only use mpg, when there are too many columns, some of them will be suppressed.

Example of Data Visualization and Exploration

In the next, we won’t learn any new commands. But I will show you what we will learn to do in this course using mpg as an example.

Please don’t worry about how the following figures are generated, which we will learn starting the next class.

Data exploration refers to the practice of finding useful patterns or trends in a given data set, and plotting graphs to visualize data is one of the most important ways to explore data.

Example of Data Visualization and Exploration

For example, Give cty being the miles per gallon in city, and hwy being miles per gallon in highway. What do the following plot indicate? Why does it look like an increasing line?

Example of Data Visualization and Exploration

Now let’s explore the effect of drive train type (f = front-wheel drive, r = rear-wheel drive, 4 = 4-wheel drive) on fuel efficiency. What conclusion can you draw from the following graph? Which type is most fuel economic?

Example of Data Visualization and Exploration

Now let’s explore the effect of number of cylinders on fuel efficiency. What conclusion can you draw from the following graph?

Summary

In this class, we learn R basics to prepare ourselves to visualize data and explore data.
Usually, visualizing data is one of the best ways to preliminarily explore hidden insights in data.
To visualize data, sometimes we need to import, clean and transform data first. This process is called data wrangling which means “fighting with data”.
We will first learn how to visualize data starting the next class.

Lecture 1 - Introduction

Miao Yu

2023-02-01

About this course

Software Preparation - Installing R and RStudio

Introduction to R

Popularity of R

R Basics - Operators in R

Exericse 1 - Arithmetic Operators

Exercise 2 - Relational Operators

R is case sensitive

Exercise 3 - Logical Operators

Exercise 4 - Assignment Operators

R Basics - Vectors

Exercise 5 - Create a vector

Exercise 6 - Create a vector

R Basics - Selecting vector elements

Exercise 7 - Selecting vector elements

Exercise 8 - Selecting vector elements

R Basics - List

R Basics - Access Elements in a List

R Basics - Functions

R Basics - Help Documentation

R Basics - Data Frames

R Basics - Basic Operations of Data Frames

Exercise 9 - Basic Operations of Data Frames

R Basics - install and load packages

R Basics - install and load packages

Other packages

Access a data set from packages

The `mpg` Data set

Example of Data Visualization and Exploration

Example of Data Visualization and Exploration

Example of Data Visualization and Exploration

Example of Data Visualization and Exploration

Summary

Lecture 1 - Introduction

Miao Yu

2023-02-01

About this course

Software Preparation - Installing R and RStudio

Introduction to R

Popularity of R

R Basics - Operators in R

Exericse 1 - Arithmetic Operators

Exercise 2 - Relational Operators

R is case sensitive

Exercise 3 - Logical Operators

Exercise 4 - Assignment Operators

R Basics - Vectors

Exercise 5 - Create a vector

Exercise 6 - Create a vector

R Basics - Selecting vector elements

Exercise 7 - Selecting vector elements

Exercise 8 - Selecting vector elements

R Basics - List

R Basics - Access Elements in a List

R Basics - Functions

R Basics - Help Documentation

R Basics - Data Frames

R Basics - Basic Operations of Data Frames

Exercise 9 - Basic Operations of Data Frames

R Basics - install and load packages

R Basics - install and load packages

Other packages

Access a data set from packages

The mpg Data set

Example of Data Visualization and Exploration

Example of Data Visualization and Exploration

Example of Data Visualization and Exploration

Example of Data Visualization and Exploration

Summary

The `mpg` Data set