Introduction to R


To understand the objectives of this course better, we will study an example. However, we need to warm up ourselves with some R basics before that.

  • R is a programming language for statistical computing and graphics. Unlike Python or C, it is not a general-purpose language. It is designed to do statistical analysis.

  • Therefore, R is really designed for anyone to use, even without any programming experience. This makes some features of R very different from other programming languages.

  • Once you know how to use R, you will find how powerful it is - you can do complicated graphing or analysis jobs with very short codes.

Popularity of R


R and Python are the two most popular programming languages in Data Science. Although not as popular as Python, R is still among the top 10 popular programming languages in 2022.

R Basics - Operators in R


R has all the operators that are in Python, as below is a brief list:

Exericse 1 - Arithmetic Operators


Try the following operations in the Console of RStudio and press Enter to see the result for each line:

2 + 3
2 - 3
2 * 3
2 / 3
2 ^ 3
2 %% 3
2 %/% 3

Note that the last three operators are different from Python.

Exercise 2 - Relational Operators


Try the following operations in the Console of RStudio and press Enter to see the result for each line:

2 == 3
2 != 3
2 > 3
2 < 3
2 >= 3
2 <= 3

Note: These operators are exactly the same as in Python - the result is a logical value.

The two logical values in R are TRUE and FALSE.

R is case sensitive


Same as Python, R is case sensitive. Therefore TRUE is not the same as True. Try the following operations in RStudio Console:

TRUE
True
FALSE
False
T
F

Note: In R TRUE is the same as T, which represents being logically true. FALSE is the same as F, which represents being logically false.

Exercise 3 - Logical Operators


Try the following operations in the Console of RStudio and press Enter to see the result for each line:

!TRUE
!FALSE
TRUE && TRUE
TRUE && FALSE
TRUE || FALSE
FALSE || FALSE

Note: we will discuss the difference of & and &&, | and || later.

Exercise 4 - Assignment Operators


In R, = is still the assignment operator. However, it is not recommended to use it. Among R community, people generally prefer <- or -> as the assignment operator. Try the following operations for each line:

a <- 3
a
4 -> a
a

Note: we use <- or -> to indicate the assigning direction (left to right or right to left).

Self-Exercise


  • Compute the following expression \[\frac{(x^2 + y^3)(x^3+y^2)}{ \sqrt{5x - 2y}}\] for \(x = 1.73, y = 2.49\)

  • First evaluate the following logical expression without using R, then use R to check your answer:

x <- 1
y <- 2
!(x > y) || (x > y)

R Basics - Vectors


Vectors in R are like lists or tuples in Python. But there is one key difference, all elements in a vector in R must be of the same data type (numbers, characters or logicals etc.)

Exercise 5 - Create a vector


Try the following operations in the Console of RStudio and press Enter to see the result for each line:

c(1,3,5)
c('a', 1, 3)
c(FALSE, 1, 3)
1:10
seq(0, 10, by = 2)          # the "by" option controls the step
seq(0, 10, length.out = 4)  # the "length.out" option controls the total number of elements

Note: for a vector, when there are multiple basic data types, it will automatically convert all elements into a single data type.

Exercise 6 - Create a vector


Create the following vector in three different ways in R:

1, 3, 5, 7, 9

R Basics - Selecting vector elements


Exercise 7 - Selecting vector elements


Try the following operations in the Console of RStudio and press Enter to see the result for each line:

my_vector = c('a','p','p','l','e')
my_vector[1]
my_vector[-1]
my_vector[2:4]
my_vector[c(2,4)]

Note: In R, the first element has the index one. This is different from Python (which starts from zero).

Exercise 8 - Selecting vector elements


Try the following operations in the Console of RStudio and press Enter to see the result for each line:

x = 1:10
x[x > 5]
x[x != 5]
x[x <= 5]
x[y > 5]

Note: one can put a logical expression to select the elements from a vector where the logical expression returns TRUE. But remember to use the same vector name in the logical expression.

Self-Exercise


The following code creates a randomly generated vector. Select all odd numbers in it.

my_vector <- sample(1:100, 25)

R Basics - Atomic Vectors


We have learned how to create vectors that contain multiple values of the same type with The c() function. These vectors are called atomic vectors, of which there are six types: logical, integer, double, character, complex, and raw. Integer and double vectors are collectively known as numeric vectors. In this course we won’t use complex and raw vectors so we ignore them here.


We can use typeof function to check the type of a vector. Try the following code in your console.

typeof(c(1,2,3))
typeof(c(1L, 2L, 3L))
typeof(c("a", "b", "c"))
typeof(c(T, F, TRUE, FALSE))

Two Key Properties for Vectors


Every vector has two key properties:

  • Its type, which you can determine with typeof().
typeof(letters)
typeof(1:10)


  • Its length, which you can determine with length().
x <- seq(1, 20, by = 3)
length(x)


There’s one other related object: NULL. NULL is often used to represent the absence of a vector (as opposed to NA which is used to represent the absence of a value in a vector). NULL typically behaves like a vector of length 0.

length(NULL)

R Data Strcture Overview


To create a heterogenous vector that contains values of different atomic types, we have to use lists, which are sometimes called recursive vectors because lists can contain other lists. The diagram below shows the hierarchy of R’s vector types

R Basics - List


To create a “vector” mixed with different data types in R, we need to create a list using the list() function:

my_list <- list(name = c('James', 'Jane'), grade = c(88, 91))


You can assign a label to each element in the list, which we will see how to use it on the next page.


Use mode() function to check the basic data type (numeric, logical, character, list) of a variable.

mode(my_list)
## [1] "list"

See List Structure


Lists are a step up in complexity from atomic vectors, because lists can contain other lists. This makes them suitable for representing hierarchical or tree-like structures. You can check its contents with structure by simply inputting the list name:

my_list
## $name
## [1] "James" "Jane" 
## 
## $grade
## [1] 88 91


If we only hope to see the structure, it’s convenient to use the str() function to do that. It is particularly useful for large lists.

str(my_list)
## List of 2
##  $ name : chr [1:2] "James" "Jane"
##  $ grade: num [1:2] 88 91

Mixed-type List and Nested List


Unlike atomic vectors, list() can contain mixed types of objects .

y <- list("a", 1L, 1.5, TRUE)
str(y)
## List of 4
##  $ : chr "a"
##  $ : int 1
##  $ : num 1.5
##  $ : logi TRUE


Lists can also contain other lists (which is what makes it powerful), forming a nested list.

z <- list(list(1, 2), list(3, 4))
str(z)
## List of 2
##  $ :List of 2
##   ..$ : num 1
##   ..$ : num 2
##  $ :List of 2
##   ..$ : num 3
##   ..$ : num 4

R Basics - Access Elements in a List


We can either use the index or the label to access an element of a list using the operator $. Usually in practice we use labels. Later we will see that how lists are useful to store results of a statistical analyses.

my_list[1]
## $name
## [1] "James" "Jane"
my_list$grade
## [1] 88 91

Self-Exercise


  • Make a list of three elements named “Name”, “Gender”, and “Grade” respectively which store information of 4 students. The first element “Name” should be a character vector of four names. The second element “Gender” should be a character vector of four gender codes (such as “F”, or “Female”). The last element should be a numeric vector of four grade scores. Give your list a name.

  • Subset your list to find the grade of the third student.

R Basics - Functions


R functions are very similar to Python functions - its syntax is function_name(positional_args, keyword_args = value, ...) (In R, keyword arguments can go first).

As below are some frequently used math functions.

R Basics - Help Documentation


In R, we either use the help() function or simply use ?function_name or ?data_name. Try the following commands in Rstudio console.

help(mean)
help('mean')
?mean
help('%in%')
?%in%


Note: For operators, you cannot use ?, and can only use help().


If you want to search for some function without knowing its name, use the “help” tab in your Rstudio interface.

Self-Exercise


  • Find the factorial function in R using the help documentation.

  • Find the inverse sine function in R using the help documentation.

R Basics - Data Frames


In this course, we will mostly handle rectangular data, or tabular data, which are simply data in spreadsheets with rows being samples and columns being features. As below is an example:

my_data <- data.frame(name = c('James', 'Alice', 'Lucy'), Math_grade = c(80, 90, 100), English_grade = c(100, 90, 80))
my_data
##    name Math_grade English_grade
## 1 James         80           100
## 2 Alice         90            90
## 3  Lucy        100            80


In this simple example, we have three students and their corresponding scores in two subjects. So each row is a sample (or observation), and each column is an attribute (usually called features or variables) for each sample.

Data Frames Are Special Lists


In R, Data frames and tibbles (we will learn this later) are built on top of lists. You can understand data frames as lists with the following properties:


  • Each element must be an atomic vector with an assigned name

  • Each element must be of the same length

R Basics - Basic Operations of Data Frames


Exercise 9 - Basic Operations of Data Frames


Try the following commands in RStudio console:

View(my_data)
my_data$name
my_data$Math_grade
my_data$English_grade
my_data[1, ]
my_data[2, ]
my_data[3, 3]

Note: Each column or row of a data frame is a vector.

R Basics - install and load packages


In this course, we are going to analyze many data sets from real world. Like import in Python, we also need to load other R packages to access data sets and functions.


One of the popular R package collection for Data Science is tidyverse. It includes packages such as ggplot2, dplyr, forcats, readr, tidyr, stringr to do different jobs in Data science, including data importing, tidying, transformation, visualization and others.


For more details, please refer to https://www.tidyverse.org/. To install the package, simply run the following commands:

install.packages("tidyverse")

R Basics - install and load packages


You will not be able to use the functions, objects, and help files in a package until you load it with library(). Once you have installed a package, you can load it with the library() function:

library(tidyverse)


This tells you that tidyverse is loading the ggplot2, tibble, tidyr, readr, purrr, and dplyr packages. These are considered to be the core of the tidyverse because you’ll use them in almost every analysis.


Packages in the tidyverse change fairly frequently. You can see if updates are available, and optionally install them, by running tidyverse_update().

Other packages


In this course we’ll use other data packages from outside the tidyverse. Let’s install some of them using the following command:

install.packages(c("nycflights13", "gapminder", "Lahman"))


After installing any package, try to load them in RStudio Console to make sure that you have installed them correctly and they are available to use.

Access a data set from packages


After we load the package tidyverse, many data sets become accessible. We can simply access the data set by their names. For example, today we will take a look at the data set named mpg.


mpg is a data set of fuel economy from 1999 to 2008 for 38 popular model of cars. More details about the data set can be found at https://ggplot2.tidyverse.org/reference/mpg.html.


Actually, we can simply access the help document in R as well with the following command.

?mpg

Question: How many samples do we have in the data set? How many variables?

The mpg Data set


Now let’s have a complete view of the data set by using the command:

View(mpg)


Or you may use the following commands as well

mpg
glimpse(mpg)

The good thing about glimpse(mpg) is that it will list all columns. If you only use mpg, when there are too many columns, some of them will be suppressed.

Example of Data Visualization and Exploration


In the next, we won’t learn any new commands. But I will show you what we will learn to do in this course using mpg as an example.


Please don’t worry about how the following figures are generated, which we will learn starting the next class.


Data exploration refers to the practice of finding useful patterns or trends in a given data set, and plotting graphs to visualize data is one of the most important ways to explore data.

Example of Data Visualization and Exploration


For example, Give cty being the miles per gallon in city, and hwy being miles per gallon in highway. What do the following plot indicate? Why does it look like an increasing line?

Example of Data Visualization and Exploration


Now let’s explore the effect of drive train type (f = front-wheel drive, r = rear-wheel drive, 4 = 4-wheel drive) on fuel efficiency. What conclusion can you draw from the following graph? Which type is most fuel economic?

Example of Data Visualization and Exploration


Now let’s explore the effect of number of cylinders on fuel efficiency. What conclusion can you draw from the following graph?

Summary


  • In this class, we learn R basics to prepare ourselves to visualize data and explore data.

  • Usually, visualizing data is one of the best ways to preliminarily explore hidden insights in data.

  • To visualize data, sometimes we need to import, clean and transform data first. This process is called data wrangling which means “fighting with data”.

  • We will first learn how to visualize data starting the next class.