To understand the objectives of this course better, we will study an example. However, we need to warm up ourselves with some R basics before that.
R is a programming language for statistical computing and graphics. Unlike Python or C, it is not a general-purpose language. It is designed to do statistical analysis.
Therefore, R is really designed for anyone to use, even without any programming experience. This makes some features of R very different from other programming languages.
Once you know how to use R, you will find how powerful it is - you can do complicated graphing or analysis jobs with very short codes.
R and Python are the two most popular programming languages in Data Science. Although not as popular as Python, R is still among the top 10 popular programming languages in 2022.
R has all the operators that are in Python, as below is a brief list:
Try the following operations in the Console of RStudio and press Enter to see the result for each line:
2 + 3
2 - 3
2 * 3
2 / 3
2 ^ 3
2 %% 3
2 %/% 3
Note that the last three operators are different from Python.
Try the following operations in the Console of RStudio and press Enter to see the result for each line:
2 == 3
2 != 3
2 > 3
2 < 3
2 >= 3
2 <= 3
Note: These operators are exactly the same as in Python - the result is a logical value.
The two logical values in R are TRUE
and
FALSE
.
Same as Python, R is case sensitive. Therefore TRUE
is
not the same as True
. Try the following operations in
RStudio Console:
TRUE
True
FALSE
False
T
F
Note: In R TRUE
is the same as T
, which
represents being logically true. FALSE
is the same as
F
, which represents being logically false.
Try the following operations in the Console of RStudio and press Enter to see the result for each line:
!TRUE
!FALSE
TRUE && TRUE
TRUE && FALSE
TRUE || FALSE
FALSE || FALSE
Note: we will discuss the difference of &
and
&&
, |
and ||
later.
In R, =
is still the assignment operator. However, it is
not recommended to use it. Among R community, people generally prefer
<-
or ->
as the assignment operator. Try
the following operations for each line:
a <- 3
a
4 -> a
a
Note: we use <-
or ->
to indicate the
assigning direction (left to right or right to left).
Compute the following expression \[\frac{(x^2 + y^3)(x^3+y^2)}{ \sqrt{5x - 2y}}\] for \(x = 1.73, y = 2.49\)
First evaluate the following logical expression without using R, then use R to check your answer:
x <- 1
y <- 2
!(x > y) || (x > y)
Vectors in R are like lists or tuples in Python. But there is one key difference, all elements in a vector in R must be of the same data type (numbers, characters or logicals etc.)
Try the following operations in the Console of RStudio and press Enter to see the result for each line:
c(1,3,5)
c('a', 1, 3)
c(FALSE, 1, 3)
1:10
seq(0, 10, by = 2) # the "by" option controls the step
seq(0, 10, length.out = 4) # the "length.out" option controls the total number of elements
Note: for a vector, when there are multiple basic data types, it will automatically convert all elements into a single data type.
Create the following vector in three different ways in R:
1, 3, 5, 7, 9
Try the following operations in the Console of RStudio and press Enter to see the result for each line:
my_vector = c('a','p','p','l','e')
my_vector[1]
my_vector[-1]
my_vector[2:4]
my_vector[c(2,4)]
Note: In R, the first element has the index one. This is different from Python (which starts from zero).
Try the following operations in the Console of RStudio and press Enter to see the result for each line:
x = 1:10
x[x > 5]
x[x != 5]
x[x <= 5]
x[y > 5]
Note: one can put a logical expression to select the elements from a
vector where the logical expression returns TRUE
. But
remember to use the same vector name in the logical expression.
The following code creates a randomly generated vector. Select all odd numbers in it.
my_vector <- sample(1:100, 25)
We have learned how to create vectors that contain multiple values
of the same type with The c()
function.
These vectors are called atomic vectors, of which there
are six types: logical, integer, double, character, complex, and raw.
Integer and double vectors are collectively known as numeric vectors. In
this course we won’t use complex and raw vectors so we ignore them
here.
We can use typeof
function to check the type of a
vector. Try the following code in your console.
typeof(c(1,2,3))
typeof(c(1L, 2L, 3L))
typeof(c("a", "b", "c"))
typeof(c(T, F, TRUE, FALSE))
Every vector has two key properties:
typeof()
.typeof(letters)
typeof(1:10)
length()
.x <- seq(1, 20, by = 3)
length(x)
There’s one other related object: NULL
.
NULL
is often used to represent the absence of a vector (as
opposed to NA
which is used to represent the absence of a
value in a vector). NULL
typically behaves like a vector of
length 0.
length(NULL)
To create a heterogenous vector that contains values of different atomic types, we have to use lists, which are sometimes called recursive vectors because lists can contain other lists. The diagram below shows the hierarchy of R’s vector types
To create a “vector” mixed with different data types in R, we need to
create a list using the list()
function:
my_list <- list(name = c('James', 'Jane'), grade = c(88, 91))
You can assign a label to each element in the list, which we will see how to use it on the next page.
Use mode()
function to check the basic data type
(numeric, logical, character, list) of a variable.
mode(my_list)
## [1] "list"
Lists are a step up in complexity from atomic vectors, because lists can contain other lists. This makes them suitable for representing hierarchical or tree-like structures. You can check its contents with structure by simply inputting the list name:
my_list
## $name
## [1] "James" "Jane"
##
## $grade
## [1] 88 91
If we only hope to see the structure, it’s convenient to use the
str()
function to do that. It is particularly useful for
large lists.
str(my_list)
## List of 2
## $ name : chr [1:2] "James" "Jane"
## $ grade: num [1:2] 88 91
Unlike atomic vectors, list()
can contain mixed types of
objects .
y <- list("a", 1L, 1.5, TRUE)
str(y)
## List of 4
## $ : chr "a"
## $ : int 1
## $ : num 1.5
## $ : logi TRUE
Lists can also contain other lists (which is what makes it powerful), forming a nested list.
z <- list(list(1, 2), list(3, 4))
str(z)
## List of 2
## $ :List of 2
## ..$ : num 1
## ..$ : num 2
## $ :List of 2
## ..$ : num 3
## ..$ : num 4
We can either use the index or the label to access an element of a
list using the operator $
. Usually in practice we use
labels. Later we will see that how lists are useful to store results of
a statistical analyses.
my_list[1]
## $name
## [1] "James" "Jane"
my_list$grade
## [1] 88 91
Make a list of three elements named “Name”, “Gender”, and “Grade” respectively which store information of 4 students. The first element “Name” should be a character vector of four names. The second element “Gender” should be a character vector of four gender codes (such as “F”, or “Female”). The last element should be a numeric vector of four grade scores. Give your list a name.
Subset your list to find the grade of the third student.
R functions are very similar to Python functions - its syntax is
function_name(positional_args, keyword_args = value, ...)
(In R, keyword arguments can go first).
As below are some frequently used math functions.
In R, we either use the help()
function or simply use
?function_name
or ?data_name
. Try the
following commands in Rstudio console.
help(mean)
help('mean')
?mean
help('%in%')
?%in%
Note: For operators, you cannot use ?
, and can only use
help()
.
If you want to search for some function without knowing its name, use the “help” tab in your Rstudio interface.
Find the factorial function in R using the help documentation.
Find the inverse sine function in R using the help documentation.
In this course, we will mostly handle rectangular data, or tabular data, which are simply data in spreadsheets with rows being samples and columns being features. As below is an example:
my_data <- data.frame(name = c('James', 'Alice', 'Lucy'), Math_grade = c(80, 90, 100), English_grade = c(100, 90, 80))
my_data
## name Math_grade English_grade
## 1 James 80 100
## 2 Alice 90 90
## 3 Lucy 100 80
In this simple example, we have three students and their corresponding scores in two subjects. So each row is a sample (or observation), and each column is an attribute (usually called features or variables) for each sample.
In R, Data frames
and tibbles
(we will
learn this later) are built on top of lists. You can understand data
frames as lists with the following properties:
Each element must be an atomic vector with an assigned name
Each element must be of the same length
Try the following commands in RStudio console:
View(my_data)
my_data$name
my_data$Math_grade
my_data$English_grade
my_data[1, ]
my_data[2, ]
my_data[3, 3]
Note: Each column or row of a data frame is a vector.
In this course, we are going to analyze many data sets from real
world. Like import
in Python, we also need to load other R
packages to access data sets and functions.
One of the popular R package collection for Data Science is
tidyverse
. It includes packages such as
ggplot2
, dplyr
, forcats
,
readr
, tidyr
, stringr
to do
different jobs in Data science, including data importing, tidying,
transformation, visualization and others.
For more details, please refer to https://www.tidyverse.org/. To install the package, simply run the following commands:
install.packages("tidyverse")
You will not be able to use the functions, objects, and help files in
a package until you load it with library()
. Once you have
installed a package, you can load it with the library()
function:
library(tidyverse)
This tells you that tidyverse
is loading the
ggplot2, tibble, tidyr, readr, purrr, and dplyr packages. These are
considered to be the core of the tidyverse because
you’ll use them in almost every analysis.
Packages in the tidyverse change fairly frequently. You can see if
updates are available, and optionally install them, by running
tidyverse_update()
.
In this course we’ll use other data packages from outside the tidyverse. Let’s install some of them using the following command:
install.packages(c("nycflights13", "gapminder", "Lahman"))
After installing any package, try to load them in RStudio Console to make sure that you have installed them correctly and they are available to use.
After we load the package tidyverse
, many data sets
become accessible. We can simply access the data set by their names. For
example, today we will take a look at the data set named
mpg
.
mpg
is a data set of fuel economy from 1999 to 2008 for
38 popular model of cars. More details about the data set can be found
at https://ggplot2.tidyverse.org/reference/mpg.html.
Actually, we can simply access the help document in R as well with the following command.
?mpg
Question: How many samples do we have in the data set? How many variables?
mpg
Data setNow let’s have a complete view of the data set by using the command:
View(mpg)
Or you may use the following commands as well
mpg
glimpse(mpg)
The good thing about glimpse(mpg)
is that it will list
all columns. If you only use mpg
, when there are too many
columns, some of them will be suppressed.
In the next, we won’t learn any new commands. But I will show you
what we will learn to do in this course using mpg
as an
example.
Please don’t worry about how the following figures are generated, which we will learn starting the next class.
Data exploration refers to the practice of finding useful patterns or trends in a given data set, and plotting graphs to visualize data is one of the most important ways to explore data.
For example, Give cty
being the miles per gallon in
city, and hwy
being miles per gallon in highway. What do
the following plot indicate? Why does it look like an increasing
line?
Now let’s explore the effect of drive train type (f = front-wheel drive, r = rear-wheel drive, 4 = 4-wheel drive) on fuel efficiency. What conclusion can you draw from the following graph? Which type is most fuel economic?
Now let’s explore the effect of number of cylinders on fuel efficiency. What conclusion can you draw from the following graph?
In this class, we learn R basics to prepare ourselves to visualize data and explore data.
Usually, visualizing data is one of the best ways to preliminarily explore hidden insights in data.
To visualize data, sometimes we need to import, clean and transform data first. This process is called data wrangling which means “fighting with data”.
We will first learn how to visualize data starting the next class.