This course covers how to import, clean, transform, visualize data and communicate the subsequent results using programming tools with R language or Python.
We will also learn how to explore data to gain useful insights.
Download the latest version of R from Comprehensive R Archive Network, or CRAN (link: https://cran.r-project.org).
RStudio is a an Integrated development environment (IDE) for R programming. You can understand it as a user-friendly graphic interface for using R. You can download and install it from http://www.rstudio.com/download.
If you are using macOS system, make sure that you download the R/RStudio version that is compatible with your macOS version.
To understand the objectives of this course better, we will study an example. However, we need to warm up ourselves with some R basics before that.
R is a programming language for statistical computing and graphics. Unlike Python or C, it is not a general-purpose language. It is designed to do statistical analysis.
Therefore, R is really designed for anyone to use, even without any programming experience. This makes some features of R very different from other programming languages.
Once you know how to use R, you will find how powerful it is - you can do complicated graphing or analysis jobs with very short codes.
R and Python are the two most popular programming languages in Data Science. Although not as popular as Python, R is still among the top 10 popular programming languages in 2022.
R has all the operators that are in Python, as below is a brief list:
Try the following operations in the Console of RStudio and press Enter to see the result for each line:
2 + 3
2 - 3
2 * 3
2 / 3
2 ^ 3
2 %% 3
2 %/% 3
Note that the last three operators are different from Python.
Try the following operations in the Console of RStudio and press Enter to see the result for each line:
2 == 3
2 != 3
2 > 3
2 < 3
2 >= 3
2 <= 3
Note: These operators are exactly the same as in Python - the result is a logical value.
The two logical values in R are TRUE
and
FALSE
.
Same as Python, R is case sensitive. Therefore TRUE
is
not the same as True
. Try the following operations in
RStudio Console:
TRUE
True
FALSE
False
T
F
Note: In R TRUE
is the same as T
, which
represents being logically true. FALSE
is the same as
F
, which represents being logically false.
Try the following operations in the Console of RStudio and press Enter to see the result for each line:
!TRUE
!FALSE
TRUE & TRUE
TRUE & FALSE
TRUE | FALSE
FALSE | FALSE
Note: we will discuss the difference of &
and
&&
, |
and ||
later.
In R, =
is still the assignment operator. However, it is
not recommended to use it. Among R community, people generally prefer
<-
or ->
as the assignment operator. Try
the following operations for each line:
a <- 3
a
4 -> a
a
Note: we use <-
or ->
to indicate the
assigning direction (left to right or right to left).
Vectors in R are like lists or tuples in Python. But there is one key difference, all elements in a vector in R must be of the same data type (numbers, characters or logicals etc.)
Try the following operations in the Console of RStudio and press Enter to see the result for each line:
c(1,3,5)
c('a', 1, 3)
c(FALSE, 1, 3)
1:10
seq(0, 10, by = 2) # the "by" option controls the step
seq(0, 10, length.out = 4) # the "length.out" option controls the total number of elements
Note: for a vector, when there are multiple basic data types, it will automatically convert all elements into a single data type.
Create the following vector in three different ways in R:
1, 3, 5, 7, 9
Try the following operations in the Console of RStudio and press Enter to see the result for each line:
my_vector = c('a','p','p','l','e')
my_vector[1]
my_vector[-1]
my_vector[2:4]
my_vector[c(2,4)]
Note: In R, the first element has the index one. This is different from Python (which starts from zero).
Try the following operations in the Console of RStudio and press Enter to see the result for each line:
x = 1:10
x[x > 5]
x[x != 5]
x[x <= 5]
x[y > 5]
Note: one can put a logical expression to select the elements from a
vector where the logical expression returns TRUE
. But
remember to use the same vector name in the logical expression.
To create a “vector” mixed with different data types in R, we need to
create a list using the list()
function:
my_list <- list(name = c('James', 'Jane'), grade = c(88, 91))
You can assign a label to each element in the list, which we will see how to use it on the next page.
Use mode()
function to check the basic data type
(numeric, logical, character, list) of a variable.
mode(my_list)
## [1] "list"
We can either use the index or the label to access an element of a
list using the operator $
. Usually in practice we use
labels. Later we will see that how lists are useful to store results of
a statistical analyses.
my_list[1]
## $name
## [1] "James" "Jane"
my_list$grade
## [1] 88 91
R functions are very similar to Python functions - its syntax is
function_name(positional_args, keyword_args = value, ...)
(In R, keyword arguments can go first).
As below are some frequently used math functions.
In R, we either use the help()
function or simply use
?function_name
or ?data_name
. Try the
following commands in Rstudio console.
help(mean)
help('mean')
?mean
help('%in%')
?%in%
Note: For operators, you cannot use ?
, and can only use
help()
.
In this course, we will mostly handle rectangular data, or tabular data, which are simply data in spreadsheets with rows being samples and columns being features. As below is an example:
my_data <- data.frame(name = c('James', 'Alice', 'Lucy'), Math_grade = c(80, 90, 100), English_grade = c(100, 90, 80))
my_data
## name Math_grade English_grade
## 1 James 80 100
## 2 Alice 90 90
## 3 Lucy 100 80
In this simple example, we have three students and their corresponding scores in two subjects. So each row is a sample (or observation), and each column is an attribute (usually called features or variables) for each sample.
Try the following commands in RStudio console:
View(my_data)
my_data$name
my_data$Math_grade
my_data$English_grade
my_data[1, ]
my_data[2, ]
my_data[3, 3]
Note: Each column or row of a data frame is a vector.
In this course, we are going to analyze many data sets from real
world. Like import
in Python, we also need to load other R
packages to access data sets and functions.
One of the popular R package collection for Data Science is
tidyverse
. It includes packages such as
ggplot2
, dplyr
, forcats
,
readr
, tidyr
, stringr
to do
different jobs in Data science, including data importing, tidying,
transformation, visualization and others.
For more details, please refer to https://www.tidyverse.org/. To install the package, simply run the following commands:
install.packages("tidyverse")
You will not be able to use the functions, objects, and help files in
a package until you load it with library()
. Once you have
installed a package, you can load it with the library()
function:
library(tidyverse)
This tells you that tidyverse is loading the ggplot2, tibble, tidyr, readr, purrr, and dplyr packages. These are considered to be the core of the tidyverse because you’ll use them in almost every analysis.
Packages in the tidyverse change fairly frequently. You can see if
updates are available, and optionally install them, by running
tidyverse_update()
.
In this course we’ll use other data packages from outside the tidyverse. Let’s install them altogether using the following command:
install.packages(c("nycflights13", "gapminder", "Lahman"))
After installing any package, try to load them in RStudio Console to make sure that you have installed them correctly and they are available to use.
After we load the package tidyverse
, many data sets
become accessible. We can simply access the data set by their names. For
example, today we will take a look at the data set named
mpg
.
mpg
is a data set of fuel economy from 1999 to 2008 for
38 popular model of cars. More details about the data set can be found
at https://ggplot2.tidyverse.org/reference/mpg.html.
Actually, we can simply access the help document in R as well with the following command.
?mpg
Question: How many samples do we have in the data set? How many variables?
mpg
Data setNow let’s have a complete view of the data set by using the command:
View(mpg)
Or you may use the following commands as well
mpg
glimpse(mpg)
The good thing about glimpse(mpg)
is that it will list
all columns. If you only use mpg
, when there are too many
columns, some of them will be suppressed.
In the next, we won’t learn any new commands. But I will show you
what we will learn to do in this course using mpg
as an
example.
Please don’t worry about how the following figures are generated, which we will learn starting the next class.
Data exploration refers to the practice of finding useful patterns or trends in a given data set, and plotting graphs to visualize data is one of the most important ways to explore data.
For example, Give cty
being the miles per gallon in
city, and hwy
being miles per gallon in highway. What do
the following plot indicate? Why does it look like an increasing
line?
Now let’s explore the effect of drive train type (f = front-wheel drive, r = rear-wheel drive, 4 = 4-wheel drive) on fuel efficiency. What conclusion can you draw from the following graph? Which type is most fuel economic?
Now let’s explore the effect of number of cylinders on fuel efficiency. What conclusion can you draw from the following graph?