This course covers how to import, clean, transform, visualize data and communicate the subsequent results using programming tools with R language or Python.
We will also learn how to explore data to gain useful insights.
Download the latest version of R from Comprehensive R Archive Network, or CRAN (link: https://cran.r-project.org).
RStudio is a an Integrated development environment (IDE) for R programming. You can understand it as a user-friendly graphic interface for using R. You can download and install it from https://posit.co/download/rstudio-desktop/.
If you are using macOS system, make sure that you download the R/RStudio version that is compatible with your macOS version.
To understand the objectives of this course better, we will study an example. However, we need to warm up ourselves with some R basics before that.
R is a programming language for statistical computing and graphics. Unlike Python or C, it is not a general-purpose language. It is designed to do statistical analysis.
Therefore, R is really designed for anyone to use, even without any programming experience. This makes some features of R very different from other programming languages.
Once you know how to use R, you will find how powerful it is - you can do complicated graphing or analysis jobs with very short codes.
R and Python are the two most popular programming languages in Data Science. Although not as popular as Python, R is still among the top 10 popular programming languages in 2022.
R has all the operators that are in Python, as below is a brief list:
Try the following operations in the Console of RStudio and press Enter to see the result for each line:
2 + 3
2 - 3
2 * 3
2 / 3
2 ^ 3
2 %% 3
2 %/% 3
Note that the last three operators are different from Python.
Try the following operations in the Console of RStudio and press Enter to see the result for each line:
2 == 3
2 != 3
2 > 3
2 < 3
2 >= 3
2 <= 3
Note: These operators are exactly the same as in Python - the result is a logical value.
The two logical values in R are TRUE
and
FALSE
.
Same as Python, R is case sensitive. Therefore TRUE
is
not the same as True
. Try the following operations in
RStudio Console:
TRUE
True
FALSE
False
T
F
Note: In R TRUE
is the same as T
, which
represents being logically true. FALSE
is the same as
F
, which represents being logically false.
Try the following operations in the Console of RStudio and press Enter to see the result for each line:
!TRUE
!FALSE
TRUE & TRUE
TRUE & FALSE
TRUE | FALSE
FALSE | FALSE
Note: we will discuss the difference of &
and
&&
, |
and ||
later.
In R, =
is still the assignment operator. However, it is
not recommended to use it. Among R community, people generally prefer
<-
or ->
as the assignment operator. Try
the following operations for each line:
a <- 3
a
4 -> a
a
Note: we use <-
or ->
to indicate the
assigning direction (left to right or right to left).
Vectors in R are like lists or tuples in Python. But there is one key difference, all elements in a vector in R must be of the same data type (numbers, characters or logicals etc.)
Try the following operations in the Console of RStudio and press Enter to see the result for each line:
c(1,3,5)
c('a', 1, 3)
c(FALSE, 1, 3)
1:10
seq(0, 10, by = 2) # the "by" option controls the step
seq(0, 10, length.out = 4) # the "length.out" option controls the total number of elements
Note: for a vector, when there are multiple basic data types, it will automatically convert all elements into a single data type.
Create the following vector in three different ways in R:
1, 3, 5, 7, 9
Try the following operations in the Console of RStudio and press Enter to see the result for each line:
my_vector = c('a','p','p','l','e')
my_vector[1]
my_vector[-1]
my_vector[2:4]
my_vector[c(2,4)]
Note: In R, the first element has the index one. This is different from Python (which starts from zero).
Try the following operations in the Console of RStudio and press Enter to see the result for each line:
x = 1:10
x[x > 5]
x[x != 5]
x[x <= 5]
x[y > 5]
Note: one can put a logical expression to select the elements from a
vector where the logical expression returns TRUE
. But
remember to use the same vector name in the logical expression.
Basic algebraic operators in R: +
, -
,
*
, /
, ^
, %%
,
%/%
Basic assignment operators in R: ->
,
<-
Basic atomic data types in R: integer, double, character and logical
Basic Logical values and operators: TRUE
,
FALSE
, ==
, !=
,
>=
, <=
, >
,
<
, &&
, ||
,
&
, |
Create atomic vectors in R: c()
,
seq()
Help documentation: use ?
or
help()
Use functions in R: positional arguments and keyword arguments.
We have learned how to create vectors that contain multiple values
of the same type with The c()
function.
These vectors are called atomic vectors, of which there
are six types: logical, integer, double, character, complex, and raw.
Integer and double vectors are collectively known as numeric vectors. In
this course we won’t use complex and raw vectors so we ignore them
here.
We can use typeof
function to check the type of a
vector. Try the following code in your console.
typeof(c(1,2,3))
typeof(c(1L, 2L, 3L))
typeof(c("a", "b", "c"))
typeof(c(T, F, TRUE, FALSE))
Every vector has two key properties:
typeof()
.typeof(letters)
typeof(1:10)
length()
.x <- seq(1, 20, by = 3)
length(x)
There’s one other related object: NULL
.
NULL
is often used to represent the absence of a vector (as
opposed to NA
which is used to represent the absence of a
value in a vector). NULL
typically behaves like a vector of
length 0.
length(NULL)
In a vector, each element can be assigned a label, for example
grades <- c(James = 79, Jane = 85, Jack = 91)
grades
## James Jane Jack
## 79 85 91
Here the values in grades
are the numbers
79, 85, 91
, but now each value is assigned with a student
name. The labels of a vector is another character vector. We can
retrieve them by the function names
:
names(grades)
## [1] "James" "Jane" "Jack"
Its type is character.
student_names <- names(grades)
typeof(student_names)
## [1] "character"
To use a function, we need to understand the following basic concepts:
Argument: inputs of a function
Required Argument: If a required argument is missing, the function will print an error message
Optional Argument: One doesn’t have to input an optional argument; if it is missing, a default value will be taken.
Positional Argument: Arguments that have to input in a specific order
Keyword Argument: Arguments that are specified by a keyword (usually optional with a default value)
my_data <- c(1, 2, 2, 5, 10, NA)
mean(my_data, trim = 0, na.rm = TRUE)
## [1] 4
In the example above, we call the function mean()
to
computer the data average. Let’s check the help documentation of the
function mean()
:
help("mean")
So we see that there are a few arguments for the function:
x
keyword - the input data: This is the first
argument; this argument is required and there is no default
value.
trim
keyword - the fraction of observations to be
trimmed from each end of x
before the mean is computed:
This is the second argument, with the default value of being 0
na.rm
keyword - a logical value being
TRUE
of FALSE
indicating whether
NA
values are removed before the mean is computed: This is
the third argument, with the default value of being
FALSE
.
Try the following code and see what it gives to you. Can you explain why the result is what you observed?
mean(trim = 0, na.rm = T)
mean(my_data)
mean(my_data, trim = 0.2)
mean(my_data, trim = 0.2, na.rum = T)
To create a “vector” mixed with different data types in R, we need to
create a list using the list()
function:
my_list <- list(student_name = c('James', 'Jane', 'Jack'), student_grade = c(79, 85, 91))
Here the first element in the list is a character vector, with the
label student_name
. The second element in the list is a
double vector, with the label student_grade
. To retrieve an
element in a list, we usually use the $
along with the
label.
my_list$student_name
## [1] "James" "Jane" "Jack"
my_list$student_grade
## [1] 79 85 91
In this course, we will mostly handle rectangular data, or tabular data, which are simply data in spreadsheets with rows being samples and columns being features. As below is an example:
my_data <- data.frame(name = c('James', 'Alice', 'Lucy'), Math_grade = c(80, 90, 100), English_grade = c(100, 90, 80))
my_data
## name Math_grade English_grade
## 1 James 80 100
## 2 Alice 90 90
## 3 Lucy 100 80
In this simple example, we have three students and their corresponding scores in two subjects. So each row is a sample (or observation), and each column is an attribute (usually called features or variables) for each sample.
In R, Data frames
and tibbles
(we will
learn this later) are built on top of lists. You can understand data
frames as lists with the following properties:
Each element must be an atomic vector with an assigned label.
Each element must be of the same length.
Again, we have the two basic functions length()
and
names()
to retrieve these information for a data frame.
length(my_data)
## [1] 3
names(my_data)
## [1] "name" "Math_grade" "English_grade"
Note that here the length()
function returns the
number of columns (features). To get the number of rows, we
need the function nrow()
nrow(my_data)
## [1] 3
Try the following commands in RStudio console. Before you execute the code, think in your mind what you expect to be the result:
View(my_data)
my_data$name
my_data$Math_grade
my_data$English_grade
my_data[1, ]
my_data[2, ]
my_data[3, 3]
In this course, we are going to analyze many data sets from real
world. Like import
in Python, we also need to load other R
packages to access data sets and functions.
One of the popular R package collection for Data Science is
tidyverse
. It includes packages such as
ggplot2
, dplyr
, forcats
,
readr
, tidyr
, stringr
to do
different jobs in Data science, including data importing, tidying,
transformation, visualization and others.
For more details, please refer to https://www.tidyverse.org/. To install the package, simply run the following commands:
install.packages("tidyverse")
You will not be able to use the functions, objects, and help files in
a package until you load it with library()
. Once you have
installed a package, you can load it with the library()
function:
library(tidyverse)
This tells you that tidyverse
is loading the
ggplot2, tibble, tidyr, readr, purrr, and dplyr packages. These are
considered to be the core of the tidyverse because
you’ll use them in almost every analysis.
Packages in the tidyverse change fairly frequently. You can see if
updates are available, and optionally install them, by running
tidyverse_update()
.
After we load the package tidyverse
, many data sets
become accessible. We can simply access the data set by their names. For
example, today we will take a look at the data set named
mpg
.
mpg
is a data set of fuel economy from 1999 to 2008 for
38 popular model of cars. More details about the data set can be found
at https://ggplot2.tidyverse.org/reference/mpg.html.
Actually, we can simply access the help document in R as well with the following command.
?mpg
mpg
Data setNow let’s have a complete view of the data set by using the command:
View(mpg)
Or you may use the following commands as well
mpg
glimpse(mpg)
The good thing about glimpse(mpg)
is that it will list
all columns. If you only use mpg
, when there are too many
columns, some of them will be suppressed.
mpg
glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
Lab Questions:
How many columns (features/varaibles) are there? How many rows (samples) are there?
What is the meaning of each variable?
What is the average mpg
in city for all car models
in the data set?
In the next, we won’t learn any new commands. But I will show you
what we will learn to do in this course using mpg
as an
example.
Please don’t worry about how the following figures are generated, which we will learn starting the next class.
Data exploration refers to the practice of finding useful patterns or trends in a given data set, and plotting graphs to visualize data is one of the most important ways to explore data.
For example, Give cty
being the miles per gallon in
city, and hwy
being miles per gallon in highway. What do
the following plot indicate? Why does it look like an increasing
line?
Now let’s explore the effect of drive train type (f = front-wheel drive, r = rear-wheel drive, 4 = 4-wheel drive) on fuel efficiency. What conclusion can you draw from the following graph? Which type is most fuel economic?
Now let’s explore the effect of number of cylinders on fuel efficiency. What conclusion can you draw from the following graph?
When we use graphs to summarize and visualize data, we are doing descriptive statistics. To understand what it is, we need to go through some basic concepts of statistics.
Let’s start our study with looking at a statement that can be commonly seen in news media:
A poll for 1,247 voters shows that 48% of US voters support the republican candidate, with a 3% margin of error.
Clearly, this is a statement based on statistics. As below are the key points to understand this statement:
A “poll” means that the data were collected from a small sample (1,247) of US voters, not all US voters.
“48% of US voters support the republican candidate” is a guess based on the poll within that small group of US voters.
“3% margin of error” measures how likely and how accurate the guess (48%) is.
In statistics, usually we study a specific collection of objects (people, companies, cars etc.). In this example, this collection of objects is all US voters. It is called a population.
Usually we are interested in a specific attribute or characteristic of the population. In this example, we are interested in whether a voter supports a candidate. This is called a random variable. It is data of this variable that are collected.
Usually it is too costly or infeasible to study the entire population. Therefore we only collect data for a subset of population. In this example, it is the 1,247 voters that answer the question. It is called a sample.
Statement: A study of survival of 1,225 newly diagnosed breast cancer cases finds that the average seven-year survival rates for Stage I breast cancer was 92%“.
seven-year survival rates: the percentage of patients that survive seven years after diagnosis.
What is the population of this study?
What is the sample of this study?
What is the random variable of this study?
A study shows that 71.6% of US adults are overweight. Answer the following question:
Under what condition is the study descriptive?
Under what condition is the study inferential?
Which one is more likely to be the case?
Categorical (or qualitative) variable: takes values that are not numerical (not numbers)
Numeric (or quantitative) variable: takes values that are numeric (numbers)
Discrete variable: A numeric variable whose possible values can be listed.
Continuous variable: A numeric variable who possible values are from interval of real numbers.
Give a real example of different variable types:
categorical but not ordinal
ordinal
discrete numeric
continuous numeric
Before we start to plot graphs, we need to review the basic knowledge of data plotting types. There are many types of them, and as below are a few examples, including some most commonly used ones:
Why do we have this many plot types? One reason is that we need different plots to best illustrate the relationship between (usually one or two) variables of different types.
bar plots: (usually) for one categorical variable
histograms: for one numeric variable
box plots: for one continuous variable
Scatter plots: (usually) for two numeric variables
Multiple box plots: for one continuous variable and one categorical/discrete variable
Stacked bar plots: for two categorical variables.
Next, we will use mpg
data set to give examples of each
plot type to explain their meaning.
A bar plot is to show the distribution of one categorical variable.
ggplot(mpg) +
geom_bar(aes(x = drv))
A histogram is to show the distribution of one numeric variable (discrete or continuous).
ggplot(mpg) +
geom_histogram(aes(x = hwy), border = 5, binwidth = 5)
A boxplot is to show a five-number summary of a numeric variable (discrete or continous).
ggplot(mpg) +
geom_boxplot(aes(x = hwy))
A scatter plot is usually to show the relationship between two numeric variables.
ggplot(mpg) +
geom_point(aes(x = hwy, y = cty))
A multiple box plot is usually to show the relationship between one categorical variable and one numeric variable.
ggplot(mpg) +
geom_boxplot(aes(x = drv, y = cty))
A stacked bar plot is usually to show the relationship between two categorical variables.
ggplot(data = mpg) +
geom_bar(mapping = aes(x = drv, fill = class))
Finish all lab exercises. Submit your R code.
Compute \[ \left( \frac{2206 \sqrt{2}}{99^2} \right)^{-1} \]. Submit your answer in a number with three decimal places.
What is the result of the following code (don’t execute the code, tell by inspection):
x <- 1
2 <- y
(x == y) | (x < y)
Create a vector containing 1,3,5,7,9 in three different ways. Give your codes.
Use the help documentation, find the factorial function and the inverse sine function in R. Submit the function name and give an example.
Create a data frame with the first column being years from 1999 to 2002, and the second column being the day of the week (Monday, Tuesday, etc.) for January 1st of that year. Submit your code and show the result.