Review of R basics


  • Basic algebraic operators in R: +, -, *, /, ^, %%, %/%

  • Basic assignment operators in R: ->, <-

  • Basic atomic data types in R: integer, double, character and logical

  • Basic Logical values and operators: TRUE, FALSE, ==, !=, >=, <=, >, <, &&, ||, &, |

  • Create atomic vectors in R: c(), seq()

  • Help documentation: use ? or help()

  • Use functions in R: positional arguments and keyword arguments.

Lab Exercise


  1. Compute \[ \left( \frac{2206 \sqrt{2}}{99^2} \right)^{-1} \]

  2. What is the result of the following code (don’t execute it, tell by inspection):

x <- 1
2 <- y
(x == y) | (x < y)
  1. Create a vector containing 1,3,5,7,9 in three different ways

  2. Use the help documentation, find the factorial function and the inverse sine function in R.

R Basics - Atomic Vectors


We have learned how to create vectors that contain multiple values of the same type with The c() function. These vectors are called atomic vectors, of which there are six types: logical, integer, double, character, complex, and raw. Integer and double vectors are collectively known as numeric vectors. In this course we won’t use complex and raw vectors so we ignore them here.


We can use typeof function to check the type of a vector. Try the following code in your console.

typeof(c(1,2,3))
typeof(c(1L, 2L, 3L))
typeof(c("a", "b", "c"))
typeof(c(T, F, TRUE, FALSE))

Two Key Properties for Vectors


Every vector has two key properties:

  • Its type, which you can determine with typeof().
typeof(letters)
typeof(1:10)


  • Its length, which you can determine with length().
x <- seq(1, 20, by = 3)
length(x)


There’s one other related object: NULL. NULL is often used to represent the absence of a vector (as opposed to NA which is used to represent the absence of a value in a vector). NULL typically behaves like a vector of length 0.

length(NULL)

Names for Vector Elements

In a vector, each element can be assigned a label, for example

grades <- c(James = 79, Jane = 85, Jack = 91)
grades
## James  Jane  Jack 
##    79    85    91

Here the values in grades are the numbers 79, 85, 91, but now each value is assigned with a student name. The labels of a vector is another character vector. We can retrieve them by the function names:

names(grades)
## [1] "James" "Jane"  "Jack"

Its type is character.

student_names <- names(grades)
typeof(student_names)
## [1] "character"

Functions basics in R


To use a function, we need to understand the following basic concepts:

  • Argument: inputs of a function

  • Required Argument: If a required argument is missing, the function will print an error message

  • Optional Argument: One doesn’t have to input an optional argument; if it is missing, a default value will be taken.

  • Positional Argument: Arguments that have to input in a specific order

  • Keyword Argument: Arguments that are specified by a keyword (usually optional with a default value)

Example

my_data <- c(1, 2, 2, 5, 10, NA)
mean(my_data, trim = 0, na.rm = TRUE)
## [1] 4

In the example above, we call the function mean() to computer the data average. Let’s check the help documentation of the function mean():

help("mean")

So we see that there are a few arguments for the function:

  • x keyword - the input data: This is the first argument; this argument is required and there is no default value.

  • trim keyword - the fraction of observations to be trimmed from each end of x before the mean is computed: This is the second argument, with the default value of being 0

  • na.rm keyword - a logical value being TRUE of FALSE indicating whether NA values are removed before the mean is computed: This is the third argument, with the default value of being FALSE.

Lab Exercise


Try the following code and see what it gives to you. Can you explain why the result is what you observed?

mean(trim = 0, na.rm = T)
mean(my_data)
mean(my_data, trim = 0.2)
mean(my_data, trim = 0.2, na.rum = T)

R Basics - List


To create a “vector” mixed with different data types in R, we need to create a list using the list() function:

my_list <- list(student_name = c('James', 'Jane', 'Jack'), student_grade = c(79, 85, 91))


Here the first element in the list is a character vector, with the label student_name. The second element in the list is a double vector, with the label student_grade. To retrieve an element in a list, we usually use the $ along with the label.

my_list$student_name
## [1] "James" "Jane"  "Jack"
my_list$student_grade
## [1] 79 85 91

Data Frames


In this course, we will mostly handle rectangular data, or tabular data, which are simply data in spreadsheets with rows being samples and columns being features. As below is an example:

my_data <- data.frame(name = c('James', 'Alice', 'Lucy'), Math_grade = c(80, 90, 100), English_grade = c(100, 90, 80))
my_data
##    name Math_grade English_grade
## 1 James         80           100
## 2 Alice         90            90
## 3  Lucy        100            80


In this simple example, we have three students and their corresponding scores in two subjects. So each row is a sample (or observation), and each column is an attribute (usually called features or variables) for each sample.

Data Frames Are Special Lists


In R, Data frames and tibbles (we will learn this later) are built on top of lists. You can understand data frames as lists with the following properties:


Again, we have the two basic functions length() and names() to retrieve these information for a data frame.

length(my_data)
## [1] 3
names(my_data)
## [1] "name"          "Math_grade"    "English_grade"

Note that here the length() function returns the number of columns (features). To get the number of rows, we need the function nrow()

nrow(my_data)
## [1] 3

R Basics - Basic Operations of Data Frames


Lab Exercise - Basic Operations of Data Frames


  1. Try the following commands in RStudio console. Before you execute the code, think in your mind what you expect to be the result:
View(my_data)
my_data$name
my_data$Math_grade
my_data$English_grade
my_data[1, ]
my_data[2, ]
my_data[3, 3]
  1. Create a data frame with the first column being years from 1999 to 2002, and the second column being the day of the week (Monday, Tuesday, etc.) for January 1st of that year.

R Basics - install and load packages


In this course, we are going to analyze many data sets from real world. Like import in Python, we also need to load other R packages to access data sets and functions.


One of the popular R package collection for Data Science is tidyverse. It includes packages such as ggplot2, dplyr, forcats, readr, tidyr, stringr to do different jobs in Data science, including data importing, tidying, transformation, visualization and others.


For more details, please refer to https://www.tidyverse.org/. To install the package, simply run the following commands:

install.packages("tidyverse")

You will not be able to use the functions, objects, and help files in a package until you load it with library(). Once you have installed a package, you can load it with the library() function:

library(tidyverse)


This tells you that tidyverse is loading the ggplot2, tibble, tidyr, readr, purrr, and dplyr packages. These are considered to be the core of the tidyverse because you’ll use them in almost every analysis.


Packages in the tidyverse change fairly frequently. You can see if updates are available, and optionally install them, by running tidyverse_update().

Access a data set from packages


After we load the package tidyverse, many data sets become accessible. We can simply access the data set by their names. For example, today we will take a look at the data set named mpg.


mpg is a data set of fuel economy from 1999 to 2008 for 38 popular model of cars. More details about the data set can be found at https://ggplot2.tidyverse.org/reference/mpg.html.


Actually, we can simply access the help document in R as well with the following command.

?mpg

The mpg Data set


Now let’s have a complete view of the data set by using the command:

View(mpg)


Or you may use the following commands as well

mpg
glimpse(mpg)

The good thing about glimpse(mpg) is that it will list all columns. If you only use mpg, when there are too many columns, some of them will be suppressed.

A glimpse of mpg


glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…


Lab Questions:


  1. How many columns (features/varaibles) are there? How many rows (samples) are there?

  2. What is the meaning of each variable?

  3. What is the average mpg in city for all car models in the data set?

Example of Data Visualization and Exploration


In the next, we won’t learn any new commands. But I will show you what we will learn to do in this course using mpg as an example.


Please don’t worry about how the following figures are generated, which we will learn starting the next class.


Data exploration refers to the practice of finding useful patterns or trends in a given data set, and plotting graphs to visualize data is one of the most important ways to explore data.

Example of Data Visualization and Exploration


For example, Give cty being the miles per gallon in city, and hwy being miles per gallon in highway. What do the following plot indicate? Why does it look like an increasing line?

Example of Data Visualization and Exploration


Now let’s explore the effect of drive train type (f = front-wheel drive, r = rear-wheel drive, 4 = 4-wheel drive) on fuel efficiency. What conclusion can you draw from the following graph? Which type is most fuel economic?

Example of Data Visualization and Exploration


Now let’s explore the effect of number of cylinders on fuel efficiency. What conclusion can you draw from the following graph?

Basic Concepts in Statistics


When we use graphs to summarize and visualize data, we are doing descriptive statistics. To understand what it is, we need to go through some basic concepts of statistics.

Let’s start our study with looking at a statement that can be commonly seen in news media:

A poll for 1,247 voters shows that 48% of US voters support the republican candidate, with a 3% margin of error.


Clearly, this is a statement based on statistics. As below are the key points to understand this statement:

Population and Sample


Lab Exercise


Statement: A study of survival of 1,225 newly diagnosed breast cancer cases finds that the average seven-year survival rates for Stage I breast cancer was 92%“.

seven-year survival rates: the percentage of patients that survive seven years after diagnosis.

Descriptive vs Inferential Statistics



Lab Exercises


A study shows that 71.6% of US adults are overweight. Answer the following question:

Types of random variables


Lab Exercise


Give a real example of different variable types:

** categorical but not ordinal

** ordinal

** discrete numeric

** continuous numeric

</div>

Introduction to data plots


Before we start to plot graphs, we need to review the basic knowledge of data plotting types. There are many types of them, and as below are a few examples, including some most commonly used ones:

Plot types depend on data types


Why do we have this many plot types? One reason is that we need different plots to best illustrate the relationship between (usually one or two) variables of different types.

  • bar plots: (usually) for one categorical variable

  • histograms: for one numeric variable

  • box plots: for one continuous variable

  • Scatter plots: (usually) for two numeric variables

  • Multiple box plots: for one continuous variable and one categorical/discrete variable

  • Stacked bar plots: for two categorical variables.

Example of a bar plot


Next, we will use mpg data set to give examples of each plot type to explain their meaning.

A bar plot is to show the distribution of one categorical variable.

ggplot(mpg) + 
  geom_bar(aes(x = drv))

Example of a histogram


A histogram is to show the distribution of one numeric variable (discrete or continuous).

ggplot(mpg) + 
  geom_histogram(aes(x = hwy), border = 5, binwidth = 5)

Example of a box plot


A boxplot is to show a five-number summary of a numeric variable (discrete or continous).

ggplot(mpg) + 
  geom_boxplot(aes(x = hwy))

Example of a scatter plot


A scatter plot is usually to show the relationship between two numeric variables.

ggplot(mpg) + 
  geom_point(aes(x = hwy, y = cty))

Example of a multiple box plot


A multiple box plot is usually to show the relationship between one categorical variable and one numeric variable.

ggplot(mpg) + 
  geom_boxplot(aes(x = drv, y = cty))

Example of stacked bar plot


A stacked bar plot is usually to show the relationship between two categorical variables.

ggplot(data = mpg) + 
  geom_bar(mapping = aes(x = drv, fill = class))