Chapter -2: Articles to Read (Optional)

-1. Intro to R: https://bookdown.org/dli/rguide/pie-chart.html#basic-r-pie-chart 0. Applied predictive modeling: https://vuquangnguyen2016.files.wordpress.com/2018/03/applied-predictive-modeling-max-kuhn-kjell-johnson_1518.pdf

  1. Raskin, R.; Terry, H. (1988). “A principal-components analysis of the Narcissistic Personality Inventory and further evidence of its construct validity”. Journal of Personality and Social Psychology, Vol 54(5), 890-902. http://www.columbia.edu/~da358/npi16/raskin.pdf

A related data collection website is here: https://openpsychometrics.org/tests/NPI/

For other data for online personality tests, refer to https://openpsychometrics.org/_rawdata/. Publications based on these data are: https://openpsychometrics.org/_rawdata/cited/

  1. Michal Kosinski, David Stillwell, and Thore Graepel (2013). “Private traits and attributes are predictable from digital records of human behavior”. PNAS, Vol 110(15) 5802-5805. https://www.pnas.org/doi/epdf/10.1073/pnas.1218772110

  2. Lyle McKinney and Heather Novak (2016). “The Relationship Between FAFSA Filing and Persistence Among First-Year Community College Students”. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.960.6889&rep=rep1&type=pdf

  3. Edward P. St. John and Choong-Geun Chung. “Student Aid and Major Choice: A Study of High-Achieving Students of Color”. https://docs.gatesfoundation.org/documents/gms_and_major_choiceedstjohn.pdf

  4. NLP based on reviews of restaurant: https://medium.com/broadhorizon-cmotions/nlp-with-r-part-0-preparing-review-data-for-nlp-and-predictive-modeling-c1f2907d8312 or on Amazon books: https://www.researchgate.net/publication/323503127_Classification_of_Amazon_Book_Reviews_Based_on_Sentiment_Analysis

  5. Nice data visulization: https://www.thelancet.com/action/showPdf?pii=S0140-6736%2822%2900327-0

  6. Waterfall charts for visualizing feature contributions in regression and classification models: https://www.r-bloggers.com/2019/01/using-waterfall-charts-to-visualize-feature-contributions/

  7. Application of logistic regression to recidivism data: https://www.ussc.gov/sites/default/files/pdf/research-and-publications/research-publications/2020/20200429_Recidivism-SentLength.pdf

Chapter -1: Review of Some Mathematical/Statistical Concepts and Methods

The Distribution of a Final Population

What is the distribution of each of the following population?

  1. A city has 1000 people, 55% are female and 45% are male.

  2. The population of words in the following paragraph:

    “Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in
    Liberty, and dedicated to the proposition that all men are created equal.

    Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so
    dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.

    But, in a larger sense, we can not dedicate – we can not consecrate – we can not hallow – this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or
    detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us – that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion – that we here highly resolve that these dead shall not have died in vain – that this nation, under God, shall have a new birth of freedom – and that government of the people, by the people, for the people, shall not perish from the earth.”

The Total Variation of Data

Given a group of numerical values: \(x_1, x_2, \cdots, x_n\), the total variation is defined as

\[\sum_{i=1} ^n (x_i - \bar{x})^2\] It is also called the total (error or within-group) sum of squares.

Exercise:

Find the total variation of the data: 2, 5, 0, 5, and 8.

The Variance of a Finite Population

The variance of a finite population is defined as the total variation divided by the size of the population. That is, if the values of a population are \(x_1, x_2, \cdots, x_n\), then, the variance is defined as

\[\frac{1}{n}\sum_{i=1} ^n (x_i - \bar{x})^2\] When applied to data, replace \(\frac{1}{n}\) by \(\frac{1}{n-1}\).

Exercise:

Find the total variation of the data: 2, 5, 0, 5, and 8.

The Total Within-Group Sum of Squares and the Between-Group Sum of Squares

When data are divided into different groups, the sum of the total variation for each group is called the total within-group sum of squares and we call

\[\sum_{k=1} ^c n_k\cdot (\bar{x}_k-\bar{x})^2\] the between-group sum of squares, where \(c\) is the number of groups, \(n_k\) is the size of the \(k\)-th group, \(\bar{x}_k\) is the \(k\)-th group mean, and \(\bar{x}\) is the grand mean, mean of all data.

It turns out that the total variation of all data is the sum of the total within-group sum of squares and the between-group sum of squares.

Exercise:

Divide the data {2, 3, 1, 0, 2, 4, 5, 1, 4} into 3 groups {2, 3, 1}, {0, 2}, and {4, 5, 1, 4}. Find

  1. the total variation of all data,

  2. the total within-group sum of squares, and

  3. the between-group sum of squares.

Check whether the total variation of all data is the sum of the total within-group sum of squares and the between-group sum of squares.

The Sum of Squared Residuals/Errors (SSE)

If the observed values of a response variable are \(y_1, y_2, \cdots, y_n\), and based on a set of explanatory variables (or called features, predictors), the fitted (or predicted) values are \(\hat{y}_1, \hat{y}_2, \cdots, \hat{y}_n\), then, the sum of squared errors (SSE) is defined as

\[\sum_{i=1} ^n (y_i - \hat{y}_i)^2\] Exercise:

If observed y values are 10, 12, 15, 18, 25 and predicted values are 12, 11, 17, 15, 25, find the SSE.

The Impurity Measure for a Discrete Population: The Gini Index

If the proportions of different classes in a population (called A) are \(p_1, p_2, \cdots, p_m\), the Gini index of this population is defined as

\[Gini(A) = 1-\sum_{i = 1}^m p_i^2\] Exercise:

Find the Gini index of the population of registered voters with 20% democrats, 30% republicans, and 50% indepndents.

The Impurity Measure for a Discrete Population: The Entropy

If the proportions of different classes in a population (called A) are \(p_1, p_2, \cdots, p_m\), the entropy of this population is defined as

\[Gini(A) = -\sum_{i = 1}^m p_i log_2 (p_i)\] In information theory, entropy is used to describe the randomness of a distribution. The use of base 2 gives Gini the unit of bits. If the base 10 (\(e\)) is used, the unit is “dits” (“nats”).

Exercise:

  1. Find the entropy (in bits) of the population of registered voters with 20% democrats, 30% republicans, and 50% independents.

  2. Show that the entropy is no smaller than the Gini index for a finite population.

The Derivative of a One-Variable Function

Examples are:

  1. If \(f(x) = x^2\), then the derivative is \(f'(x) = 2x\).

  2. If \(f(x) = e^x\), then the derivative is \(f'(x) = e^x\).

  3. If \(f(x) = \frac{1}{1+e^{-x}}\), then the derivative is \(f'(x) = \frac{e^{-x}}{(1+e^{-x})^2}\). which can be written as \(f(x)[1-f(x)]\).

The Partial Derivative of a Multivarite Function

For a multivariate function, the partial derivative with respect to a variable is just the same as the derivative with respect to this variable, holding other variables fixed.

Exercise:

  1. Find the two partial derivatives of \(f(x, t) = \frac{1}{2}(x-t)^2\), find the two partial derivatives.

  2. If \(f(c, b) = \frac{1}{1+e^{-(cx+b)}}\), find the two partial derivatives.

The Euclidean Distance Between Two Points in a High-Dimensional Space

  1. In a two-dimensional space, find the distance between the two points: P(-3, 4) and Q(9, -1)$.

  2. In a 5-dimensional space, find the distance between the two points: P(2, 0, 0, 1, 5) and Q(-2, 0, 1, 1, -1)$.

  3. Given 5 points P(0, 1, 1), Q(-2, 0, 0), R(3, 3, -1), S(1, -5, 2), and T(6, 0, 2) in a 3-D space, what are the 3 closest neighbors of the point A(-1, 0, -2)?

Probability and Odds

Let’s denote the probability of an event \(A\) by \(P(A)\). Then, the odds of \(A\) happening is defined as \(\frac{P(A)}{1-P(A)}\). That is, the odds of an event equals the probability of the event divided by the probability that the event does not happen.

When throwing a fair die, the probability of getting heads, denoted \(P(H)\), equals 0.5. The odds of getting heads are then \(\frac{0.5}{1-0.5}=1\), sometimes written as the ratio 1:1.

When throwing a fair die, the probability of getting 6 points is 1 out of 6. The odds of getting 6 points are \(\frac{1/6}{1-1/6}=1/5\), or 1:5.

As the probability increases, the odds increase as well.

Two-way Tables with Variables of the Same Nature

When a model has been built based on a training data set to classify a binary response against a set of predictors, the model can be used to predict the same response variable based on a validation data set. The predicted values of the response can be used in conjunction with the actual response from the validation data to construct a two-way table. The table (also called a confusion matrix) can then be used to assess the performance of the model in terms of different measures or graphical displays.

Here is an example.

actual = c(1,1,1,1,1,1,1,1,0,0,0,0,0,0)  # We call "1" a positive and "0" a negative.
# The propensity score (i.e., probability for the class to be 1):
propensity = c(0.67, 0.93, 0.44, 0.83, 0.45, 0.77, 0.65, 0.81, 0.12, 0.04, 0.38, 0.22, 0.58, 0.45)

n = length(actual)

# If a cutoff of 0.5 is used, the predicted labels are
predicted = ifelse(propensity >= 0.5, 1, 0)
T = table(predicted, actual)
cat("The confusion matrix is:", T, "\n")
## The confusion matrix is: 5 1 2 6
# What are the number of true positives, false positives, true negatives, and false negatives?
tp = T[1, 1]
fp = T[1, 2]
fn = T[2, 1]
tn = T[2, 2]

# What percent of actual positives are predicted to be positive? This is the sensitivity or recall.
se = tp/(tp + fn)
cat("The sensitivity is:", se, "\n")
## The sensitivity is: 0.8333333
# What percent of actual negatives are predicted to be negative? This is the specificity.
sp = tn/(fp + tn)
cat("The specificity is:", se, "\n")
## The specificity is: 0.8333333
# What percent of observations are predicted to be positive? This is called the support.
su = (tp+fp)/n  # Here n is the total number of observations.
cat("The support is:", su, "\n")
## The support is: 0.5
# What percent of the original observations are positive? This is the base rate.
br = (tp + fn)/n  # This does not depend on prediction. It is determined by the original data.
cat("The base rate is:", br, "\n")
## The base rate is: 0.4285714
# When an arbitrary cutoff c is used, se, sp, and su are all depend on c.
# The plot of se versus su is called the gains curve.
# The plot of se/su versus su is called the lift curve.
# The plot of se versus fn is called the receiver operating characteristic (ROC) curve.

M = sapply(c(0, sort(propensity), 1), 
       function(c){
          pred = ifelse(propensity >= c, 1, 0)
          T = table(pred = factor(pred, levels = c(1,0)), 
                    actual = factor(actual, levels = c(1,0))
                   )
          tp = T[1, 1]
          fp = T[1, 2]
          fn = T[2, 1]
          tn = T[2, 2]
          se = tp/(tp + fn)
          fn = fp/(fp + tn)
          n = length(actual)
          su = (tp+fp)/n 
          c(c = c, se = se, fn = fn, su = su, lift = se/su)
       }
)

D = t(M)

plot(se~su, data = D, type = "l", xlim = c(0, 1), ylim = c(0, 1),
     xlab = "Support",
     ylab = "Sensitivity",
     main = "The Gains Curve") + segments(0,0,1,1,lty = "dashed")

## integer(0)
plot(lift~su, data = D, type = "l", xlim = c(0, 1), 
     main = "The Lift Curve",
     xlab = "Support",
     ylab = "Lift"
     )

plot(se~fn, data = D, type = "l", xlim = c(0, 1), ylim = c(0, 1), 
     main = "The ROC",
     xlab = "1-Specificity",
     ylab = "Sensitivity") + segments(0,0,1,1,lty = "dashed")

## integer(0)
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
roc = roc(actual~propensity)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc, main = "ROC Curve")

Creating Dummy Variables

In many situations such as when fitting a neural network model, categorical predictor variables need to be dummified. An R function that does this is model.matrix(). A categorical variable with \(k\) categories will have \(k\) dummy variables created. When fitting a model involving combinations of variables, only \(k-1\) of these dummy variables are used and it is up to the user to decide which one to remove.

# In the following, "0" means to remove the constant column which has all values of 1
# "." means to use all variables.
Dummified.df = model.matrix(object = ~ 0 + .,  data = iris) %>% as.data.frame()

head(Dummified.df) 
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Speciessetosa
## 1          5.1         3.5          1.4         0.2             1
## 2          4.9         3.0          1.4         0.2             1
## 3          4.7         3.2          1.3         0.2             1
## 4          4.6         3.1          1.5         0.2             1
## 5          5.0         3.6          1.4         0.2             1
## 6          5.4         3.9          1.7         0.4             1
##   Speciesversicolor Speciesvirginica
## 1                 0                0
## 2                 0                0
## 3                 0                0
## 4                 0                0
## 5                 0                0
## 6                 0                0

Data Partition

For predictive analytics, data need to be partitioned into two or three data sets for different purposes.

partitionData = function(df, prop = c(0.60, 0.20, 0.20)){
    n = nrow(df)
    idx = sample(1:n, n)
    n1 = round(n*prop[1])
    n2 = round(n*prop[2])
    n3 = n - n1 - n2
    
    train.idx = idx[1:n1]
    valid.idx = idx[(n1+1):(n1+n2)]
    test.idx = idx[-c(1:(n1+n2))]
    
    train = df[train.idx, ]
    valid = df[valid.idx, ]
    test = df[test.idx, ]
    if (length(prop) == 2){
      list(train = train, valid = valid)
    } else{
      list(train = train, valid = valid, test = test)
    }
    
}

Chapter 0: Intro to R

What is R Markdown

To create an R Markdown file and produce an output from it, follow the steps:

  1. Open RStudio.

  2. From upper-left corner, Choose R Markdown (or in cloud, File -> New File -> R Markdown) -> Click Document -> Choose a Title -> Click OK. Now, you should see a template created.

  3. Edit the template to meet your needs. Open the “Knit” dropdown and choose “Knit to HTML”. Save the file. A good practice is to create a folder and place the file in it. Now, you should see a page pop up. You can click the eye dropdown to publish the output page by choosing Rpubs.

The current document is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

x=2
y=3

# R can do arithmetic
h = 2*x^2 - 3*x +1    # Use the asterisk (*) for multiplication and "^" means "to the power of"
z=2*x - y^(1/3) + 3*(x-y) - log(100)  # The carat (^) is for exponentiation with 1/3 the exponent
                                      # The "log" is called an R function
                                      # A function takes one or more inputs and return something as an output

z
## [1] -5.04742

Note: In R, any statement starting with # is a comment!

For those who have taken CNA267 (python programming), you can insert a code chunk for running python:

import math as m
x=2
y=3
z=2*x-y/3+(x-y)**3 + m.log(100) # R can do arithmetic

z
## 6.605170185988092

Python will not be used in this course! Just for a comparison.

You can also embed plots, for example:

plot(2,5) # Plot the point with coordibates (2, 5)

Data Types in R

R has some data types, including numeric, integer, character (or string), logical, and complex. To find the type of an object x, use the class() function as class(x).

x <- 3.4  # A numeric value. "<-" means to assign 3.4 to the variable x
3.4 -> x  # The same
x = 3.4   # Still the same. I will use "="

y = 100L    # An integer (followed n=by L), not often used in statistics

z = "Tom"   # A character

W <- 3>5  # Here "3>5" is called a logical expression, which has a value of "FALSE"

G = 2^3 < 3^2 # G takes the logical value of "TRUE", since 8 is less than 9

r = 3+2i # A complex number with real part 3 and imaginary part 2. Not useful in statistics
         # A complex number a+bi is equivalent to a point x-y plane with coordinates (a, b)


# Some special numeric values
a = Inf  # Infinity. Inf occurs when dividing by 0
b = NaN  # NaN = Not a Number. NaN occurs when taking square root on a negative number

class(a)
## [1] "numeric"
class(b)
## [1] "numeric"
is.numeric(a)
## [1] TRUE
is.numeric(b)
## [1] TRUE
# A special logical value
c = NA  # NA represents a missing value
is.logical(c)
## [1] TRUE

Data Structures in R

R has some data structures, including vector, matrix, arrays, data frame, list, and factor.

x=1:8 # A patterned numeric vector
y = c(1, 2, 3, 4, 5, 6, 7, 8) # Same vector
length(y) # Count the number of elements in vector y
## [1] 8
# We can change a value of a vector, if we want.
y[4] = 400 # Reset the 4th value of the y vector to 400. The number 4 is the index.
y # You can see the 4th value has been successfully changed.
## [1]   1   2   3 400   5   6   7   8
z = as.numeric(10) # Create a numeric vector of 10 zeros.
z
## [1] 10
z = c("Republican", "Democratic", "Independent") # A character vector (vector of strings)

M = matrix(y, 2,4) # A 2 by 4 matrix (2-rows and 4-column 2-dimensional array) based on the y vector

df = data.frame(x=1:7, y=c(2,4,5,1,0,3,2)) # A data frame of 2 columns (x and y). 
                                           # Here x and y are different from previous 2! 
                                           # They are local to the data frame.
df  # Display the data frame 
##   x y
## 1 1 2
## 2 2 4
## 3 3 5
## 4 4 1
## 5 5 0
## 6 6 3
## 7 7 2
dim(df) # the dimension (number of rows and number of columns) of the data frame
## [1] 7 2
names(df) # variable (or column) names
## [1] "x" "y"
df[ ,2]   # Extract a column by index (the second column extracted)
## [1] 2 4 5 1 0 3 2
df$y    # Extract a column by name
## [1] 2 4 5 1 0 3 2
df$w = df$y^2 # Add a new column (square of the y column) to the data frame "df"
df  # Now the df has 3 columns
##   x y  w
## 1 1 2  4
## 2 2 4 16
## 3 3 5 25
## 4 4 1  1
## 5 5 0  0
## 6 6 3  9
## 7 7 2  4
names(df) # display the variable (column) names as a character vector
## [1] "x" "y" "w"
names(df)[1] = "ID"
names(df) # The first column name of df has been changed successfully.
## [1] "ID" "y"  "w"
L = list(x=x, y=y, M=M, df=df) # Create a list based on existing objects. 
                               # The list has 4 elements: x, y, M, df
                               # Here x = x means that the first x is the element 
                               # and the second x is the value of it. The second x already exists.

L$M # Extract the M element in the list
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2  400    6    8
L[[3]]
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2  400    6    8
r = c(1,2,2,3,1,0,1,0,2, 4, 3, 2) # A numeric vector or 1-D array
table(r)  # Creates a frequency distribution for the data
## r
## 0 1 2 3 4 
## 2 3 4 2 1
# Convert the vector r to an un-ordered factor
r1 = factor(r)
r1
##  [1] 1 2 2 3 1 0 1 0 2 4 3 2
## Levels: 0 1 2 3 4
# Convert the vector r to an ordered factor
r1 = ordered(r)
r1
##  [1] 1 2 2 3 1 0 1 0 2 4 3 2
## Levels: 0 < 1 < 2 < 3 < 4
r2 = ordered(r, levels = c(4,3,2,1,0))
r2
##  [1] 1 2 2 3 1 0 1 0 2 4 3 2
## Levels: 4 < 3 < 2 < 1 < 0
barplot(table(r1))

barplot(table(r2)) # Order of the bars changes

r = c(1,2,2,3,1,0,1,0,2, 4, 3, 2) # A numeric vector or 1-D array

# The following converts r to a factor with meaningful labels
R = factor(r, 
           levels = 0:4,    # Levels are related to original distinct values in the r vector
           labels  = c("Not At All Satisfied", 
                       "Slightly Satisfied", 
                       "Neutral", 
                       "Very Satisfied", 
                       "Extremely Satisfied")   # The labels are arbitrary
          )

barplot(table(R)) # To make a barplot, the raw data should be tabulated first.

par(mar=c(5+4,4,4,2)) # Reset margins so that the labels of the following barplot are fully displayed
                      # The default margins are 5 (bottom), 4 (left), 4 (top), and 2(right).
                      # I make the bottom margin larger so the labels are seen.
barplot(table(R), las = 2) # More meaningful bar plot. las = 2 will make the labels displayed vertically

Built-in Functions in R

A function allows a user to use the same code repeatedly. We have used quit a few built-in functions in R. Here are some more.

x = c(23, 34, 56, 12, 20, 31, 42, 55, 22, 38, 19, 53, 28, 61, 92, 46, 24, 56, 10, 83, 72, 61, 82, 27, 53, 36)

is.numeric(x) # Check whether x is a numeric variable
## [1] TRUE
y = c(34, 31, 44, 52, 37, NA, 62, 70, NA, 77) # In R, "NA" represents a missing value
is.na(y)  # A vector of logical values
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
# The following are statistical summary (aggregate) functions
min(x) # Minimum of data vector x
## [1] 10
max(x) # Maximum
## [1] 92
mean(x) # Calculate the mean of data vector x
## [1] 43.69231
var(x) # Variance
## [1] 513.1015
sd(x) # standard deviation
## [1] 22.65174
summary(x) # A summary containing minimum, maximum, low and upper quartiles, median, and mean
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   24.75   40.00   43.69   56.00   92.00
quantile(x, probs = 0.95) # Calculate the 95th percentile
##   95% 
## 82.75
# To find the summaries of a numeric vector containing missing values, we need to remove NA's 
mean(y) # This will produce an NA value, since y contains NA's
## [1] NA
mean(y, na.rm = TRUE)  # Must specify the option "na.rm = TRUE" so that NAs are discounted
## [1] 50.875
sd(y, na.rm = TRUE)
## [1] 17.2911
# The following are functions for statistical graphics 
hist(x, xlab = "Grades", main = "The histogram of Scores", col = "blue") # Histogram

boxplot(x, xlab = "Grades", main = "The histogram of Scores", col = "blue")  # Boxplot

# Some math functions
sqrt(x) # Square roots of all elements
##  [1] 4.795832 5.830952 7.483315 3.464102 4.472136 5.567764 6.480741 7.416198
##  [9] 4.690416 6.164414 4.358899 7.280110 5.291503 7.810250 9.591663 6.782330
## [17] 4.898979 7.483315 3.162278 9.110434 8.485281 7.810250 9.055385 5.196152
## [25] 7.280110 6.000000
log(x) # Log of x
##  [1] 3.135494 3.526361 4.025352 2.484907 2.995732 3.433987 3.737670 4.007333
##  [9] 3.091042 3.637586 2.944439 3.970292 3.332205 4.110874 4.521789 3.828641
## [17] 3.178054 4.025352 2.302585 4.418841 4.276666 4.110874 4.406719 3.295837
## [25] 3.970292 3.583519
abs(x) # Absolute values
##  [1] 23 34 56 12 20 31 42 55 22 38 19 53 28 61 92 46 24 56 10 83 72 61 82 27 53
## [26] 36
round(x, 2) # Round each value in x to 2 decimal places
##  [1] 23 34 56 12 20 31 42 55 22 38 19 53 28 61 92 46 24 56 10 83 72 61 82 27 53
## [26] 36
# The following shows a better print function called cat()
Name = "Tom"
Age = 18

cat(Name, "is", Age, "years old.")
## Tom is 18 years old.
cat(Name, "is", Age, "years old.\n") # Add a new line symbol (\n) at the end.
## Tom is 18 years old.
paste(c("A", "B", "C"), c(100, 200, 300), sep = ":") # Paste the two vectors using a colon as a separator/delimiter.
## [1] "A:100" "B:200" "C:300"
paste("Year", c(2019, 2020, 2021), sep = "") # Paste "Year" to each of the 3 years
## [1] "Year2019" "Year2020" "Year2021"
# Gluing things together can be useful when creating labels for a plot.

x = c(56, 78, 89, 92, 75, 66)
ifelse(x<80, "Study hard!", "Good Job!") # If the condition is met, do the first part; O.W., do second part.
## [1] "Study hard!" "Study hard!" "Good Job!"   "Good Job!"   "Study hard!"
## [6] "Study hard!"
# A more general function than "ifelse" is the "case_when" function from package "dplyr"
x = c(56, 78, 89, 92, 75, 66)

library(dplyr)

case_when(
  x < 60 ~ "Fail",
  x < 80 ~ "Good",
  x < 90 ~ "Very Good",
  TRUE   ~ "Excellent!"
)
## [1] "Fail"       "Good"       "Very Good"  "Excellent!" "Good"      
## [6] "Good"
# Identify missing values
y = c(23, 45, NA, 15, NA, 90, log(-19))
## Warning in log(-19): NaNs produced
is.na(y) # Check whether each element in y is missing (NA)
## [1] FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE
# Extract a part of a string
substring("The book is worth $180.", 5, 8) # Extract a part of the text from 5th char to 8th char.
## [1] "book"
substring("The book is worth $180.", 19) # If ending is not specified, it is the end of the text.
## [1] "$180."

Some built-in functions that process strings using regular expressions:

myString = "We will eventually master the R package 'ggplot2'! Don't you believe? Knowing how to use ggplot2 might boost your earnings by $20000 in year 2022, even $25000 in year 2003."

# Split string with the strsplit() function to get individual words or numbers
strsplit(myString, "[^a-zA-Z0-9']+")  # Match, in myString, characters that are NOT letters or digits
## [[1]]
##  [1] "We"         "will"       "eventually" "master"     "the"       
##  [6] "R"          "package"    "'ggplot2'"  "Don't"      "you"       
## [11] "believe"    "Knowing"    "how"        "to"         "use"       
## [16] "ggplot2"    "might"      "boost"      "your"       "earnings"  
## [21] "by"         "20000"      "in"         "year"       "2022"      
## [26] "even"       "25000"      "in"         "year"       "2003"
                                  # Once a match is found, split the myString at the match point.
                                  # The result is a list with only one element.

strsplit(myString, "[^a-zA-Z0-9']+")[[1]]   # Extract the only element in the list
##  [1] "We"         "will"       "eventually" "master"     "the"       
##  [6] "R"          "package"    "'ggplot2'"  "Don't"      "you"       
## [11] "believe"    "Knowing"    "how"        "to"         "use"       
## [16] "ggplot2"    "might"      "boost"      "your"       "earnings"  
## [21] "by"         "20000"      "in"         "year"       "2022"      
## [26] "even"       "25000"      "in"         "year"       "2003"
unlist(strsplit(myString, "[^a-zA-Z0-9']+")) # Or use the unlist() function
##  [1] "We"         "will"       "eventually" "master"     "the"       
##  [6] "R"          "package"    "'ggplot2'"  "Don't"      "you"       
## [11] "believe"    "Knowing"    "how"        "to"         "use"       
## [16] "ggplot2"    "might"      "boost"      "your"       "earnings"  
## [21] "by"         "20000"      "in"         "year"       "2022"      
## [26] "even"       "25000"      "in"         "year"       "2003"
# Extract all dollar amounts in a string with the str_extract_all() function
library(stringr)
str_extract_all(myString, "(?<=\\$)\\d+")          # Here "\\$" matches a $ sign
## [[1]]
## [1] "20000" "25000"
                                               # "\\d+" matches one or more digits 
                                               # "?<=" matches a $ preceding a number

For learning more about regular expressions in R, refer to https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html. ## User-Defined Functions in R

Users can define their own functions. I will give two examples. The first example is a very easy one, which takes two numbers as input and outputs/returns a number.

# Define the function
f = function(x, y){  
  return(x*y-x^2)
}

Here,

  • “f” is the function name,
  • “function” is a reserved word,
  • x and y are inputs/parameters.

Let’s use the defined function:

# Call the function
f(2,5) 
## [1] 6

The second example defines a function that takes a vector as input and returns a list of summary statistics. This is shown as below:

mySummary = function(x){
  mi = min(x)
  ma = max(x)
  lower = as.numeric(quantile(x, probs = 0.25))
  upper = as.numeric(quantile(x, probs = 0.75))
  me = median(x)
  m = mean(x)
  s = sd(x)
  
  L = list(min = mi, max = ma, Q1 = lower, median = me, Q3 = upper, mean = m, 'standard deviation' = s)
  
  return(L)
  
}

Let’s use the defined function:

x = c(23, 45, 23, 14, 92)

mySummary(x)
## $min
## [1] 14
## 
## $max
## [1] 92
## 
## $Q1
## [1] 23
## 
## $median
## [1] 23
## 
## $Q3
## [1] 45
## 
## $mean
## [1] 39.4
## 
## $`standard deviation`
## [1] 31.54838

If…else Logic

When dealing with case work, the use of an if-else structure is useful. Here is an example with two cases.

grade = 86

if (grade > 80){
  print("good")
} else {
  print("Study hard!")
}
## [1] "good"

Note that the above can be implemented using the built-in “ifelse” function, as shown below:

grade = 86

ifelse(grade > 80, "good", "Study hard!")
## [1] "good"
ifelse(grade > 90, "good", "Study hard!")
## [1] "Study hard!"

Here is an example with three cases:

grade = 86

if (grade > 80){
  print("Good job!")
} else if (grade > 60){
  print("You passed.")
} else {
  print("Study hard!")
}
## [1] "Good job!"

When there are three cases, a “if-else if-else” logic can be used. Similarly, when there are three cases, a “if-else if-else if-else” logic can be used.

For Loops

A “for loop” is used for iterating over a vector. It is used when the total number of iterations is known.

Example 1: Find the sum of all values in a numeric vector.

x = c(2, 0, 5, 1, 2, 4, 2)

Sum = 0 # Set the initial value (0) for the variable Sum

for (k in x){
  Sum = Sum + k  # Updating Sum: each time, the variable "Sum" increases by k
}

print(Sum)
## [1] 16

There is a built-in function that allows you to directly find the sum of a numeric vector:

sum(x)
## [1] 16

Example 2: Plot each variable in a data frame.

# Use the data frame "diamonds" from the ggplot2 package.
D = ggplot2::diamonds

# Loop through columns and create appropriate plots based on variable type
for (col in colnames(D)) {
  if (is.numeric(D[[col]])) {
    # Plot histograms for numeric columns
    hist(D[[col]], main = paste("Histogram of", col), xlab = col, col = "lightblue")
  } else if (is.factor(D[[col]])) {
    # Plot bar plots for factor columns
    barplot(table(D[[col]]), main = paste("Bar Plot of", col), col = "lightcoral")
  } else if (is.logical(D[[col]])) {
    # Plot bar plots for logical columns
    barplot(table(D[[col]]), main = paste("Bar Plot of", col), col = "lightgreen")
  } else {
    # Handle other types (e.g., character, date)
    print(paste("No plot available for", col, "with type", class(D[[col]])))
  }
}

While Loops (Optional)

When the total number of iterations is unknown, a for loop is not possible, but the while loop can handle this situation.

A simple example: simulate the process of flipping a coin until a head is seen.

# Function to simulate coin flips until a Head appears
flip_until_head <- function() {
  flips <- 0  # Counter to keep track of the number of flips
  result <- "T"  # Initialize result to be Tail

  # Repeat flips until a Head appears
  while(result != "H") {
    flips <- flips + 1  # Increment flip counter
    result <- sample(c("H", "T"), 1, replace = TRUE)  # Simulate coin flip
  }
  
  return(flips)  # Return the number of flips it took to get a Head
}

# Simulate the process of flipping a coin 1000 times until a head is seen,
# and create a frequency table of the number of flips required.
barplot(table(replicate(1000, flip_until_head())))

A more complicated function is given below. Skip it if it’s too difficult to understand:

###### The code checks if an input string is in balance for parentheses

inBalance = function(s){  # s is an input string
  
  s = unlist(strsplit(s, ""))          # Split the input string to individual chars and place the chars in the vector s.
  
  if (s[1] %in% c(")", "]", "}")){     # If the first char is an ending parenthesis, the string is not in balance.
    return(FALSE)                      # If a return statement is executed, no continuation is needed.
  }
  
  i = 2                                
  while (i <= length(s)){              # Since the first char is not an ending parenthesis, focus on other chars.
    # Check each pair of two consecutive chars. If pairing is found, update the vector of chars by removing the pair.
    if (((s[i-1] == "(") & (s[i] == ")"))|((s[i-1] == "[") & (s[i] == "]"))|((s[i-1] == "{") & (s[i] == "}"))){
      s = s[-c(i-1,i)]                  # update the vector of chars by removing the pair.
      if (s[1] %in% c(")", "]", "}")){  # Immediately check the first char of the updated vector to see whether it is an ending.
        return(FALSE)
      } 
      i = max(2, i -1)  # If i was 2, starting at place 2; otherwise, start from one place before, since you removed 2 places.
    } else {
      i = i + 1  # If no pairing at places i - 1 and i, move the check point one place ahead.
    }
  }
  
  if (length(s) == 0) {
    return(TRUE)
  }else {
    return(FALSE)
  }
}

##### Use the function
inBalance("()")  # TRUE
## [1] TRUE
inBalance("()[]") # TRUE
## [1] TRUE
inBalance("()()]") # FALSE
## [1] FALSE
inBalance("(()())") # TRUE
## [1] TRUE
inBalance("[()[({})]{}()[]]") # FALSE
## [1] TRUE
inBalance("()[({})({)]")      # TRUE
## [1] FALSE
inBalance("[(()[({})({})]){}]()[{}]")      # TRUE
## [1] TRUE

Importing Data into R

You can import data into R from a text file or a csv file using the read.table() function. You can use it as

     read.table("YourFile with extention", 
                header = FALSE or TRUE, 
                sep = " " or "," or "\t" or something else, 
                skip = 0 or a positive number, 
                stringsAsFactors = FALSE or TRUE)

The read.csv() function can be used only for a csv file. You can use it as

    read.csv("YourFile with or without extention", header = TRUE, sep = ",")
    

A fast and friendly file finagler is the fread() function from the package “data.table”. You can use it as

    fread("YourFile with extension",
          nrows = maximum number of rows to read, 
          skip = number of rows to skip)

In addition, you can use the read_excel() function in the package “readxl” to import a worksheet from an excel file (called a workbook). You can use it as

       read.xlsx("YourFile.xlsx", sheet = sheet number or "sheet name")
       

Each file that you import could exist remotely on another computer, so in this case, the file is a link. Each of the above functions has more parameters. You will need to check out the documentation of the function you use. To check out the documentation of a function, type: ?functionName on the console.

Using SQL Statements with Package “sqldf”

You can use the function sqldf() from the package “sqldf” to select data from a data frame. The simplest usage of the package is

  sqldf("select * from df")
  

which selects all rows from the data frame “df”; that is, it just give you the same data frame as the original one. Some example uses include (df refers to a data frame and col1 or col2 is a column)

  1. sqldf(“select * from df limit 6”)

  2. sqldf(“select * from df where …”)

  3. sqldf(“select groupVariable, avg(col2) from df having avg(col1) > aNumber)

A good reference for the use of sqldf package is https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/sql.html

You can add the option “verbose = TRUE” to each of the examples to find out what happens behind the screen. For instance:

    sqldf("select * from df", verbose = TRUE)
    

says that a temporary database was created and later destroyed.

df1 = sqldf("select * from mtcars limit 6") # Display the first 6 rows of mtcars
df2 = sqldf("select * from mtcars order by mpg limit 6") # mtcars ordered by mpg, displaying first 6 rows 
df3 = sqldf("select * from mtcars where mpg>25 and am = 1") # Select all automatic cars with more than 25 miles per gallon

# Create a data frame that displays mean mpg and mean displacement for each combination of carb and am
# The aggregate function avg() is used and a condition for selecting cars involves the aggregate function.
# So, instead of using the where clause, we use the having clause.
df4 = sqldf("select carb, am, avg(mpg) as 'AVG of mpg', avg(disp) as 'AVG of disp' from mtcars group by carb, am  having hp > 100")

df5 = sqldf("select carb, count(*) as Count from mtcars group by carb")

df1
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
df2
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1 10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
## 2 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
## 3 13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
## 4 14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
## 5 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
## 6 15.0   8  301 335 3.54 3.570 14.60  0  1    5    8
df3
##    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 2 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 3 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 4 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## 5 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## 6 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
df4
##   carb am AVG of mpg AVG of disp
## 1    1  0   20.33333    201.0333
## 2    2  0   19.30000    278.2500
## 3    3  0   16.30000    275.8000
## 4    4  0   14.30000    345.3143
## 5    4  1   19.26667    223.6667
## 6    6  1   19.70000    145.0000
## 7    8  1   15.00000    301.0000
df5
##   carb Count
## 1    1     7
## 2    2    10
## 3    3     3
## 4    4    10
## 5    6     1
## 6    8     1

Graphics with R

A very good data visualization page is https://www.r-graph-gallery.com/. You can click each graph and see the code for making it using R tools, some being special packages.

Base Plots

We have introduced the hist(), boxplot() and barplot() functions for creating histograms, boxplots, and bar plots in R. To create a scatterplot, we use the plot() function.

x = c(0.5, 3, 0, 4, 5, 2, 3, 6, 1.5, 5)
y = c(78, 89, 67, 90, 93, 77, 85, 95, 75, 88)

D = data.frame(x, y)  # It's a good habit to make a data frame

plot(x, y, 
     xlab = "x",                             # The title of the x-axis
     main = "The Scatter Plot of y vs. x",   # The tile of the plot
     pch = 19,                               # The solid dot is chosen as the plotting character 
     col = "blue",                           # The color of each dot
     cex = 2                                 # The size of each dot
    )

# Fit a regression model
m = lm(formula = y~x,   # y: the response variable, x: the only explanatory variable here.
       data = D
      )

summary(m)
## 
## Call:
## lm(formula = y ~ x, data = D)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2479 -2.5305 -0.0767  1.9370  5.3000 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  71.2479     2.1355  33.363 7.11e-10 ***
## x             4.1507     0.6004   6.913 0.000123 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.628 on 8 degrees of freedom
## Multiple R-squared:  0.8566, Adjusted R-squared:  0.8387 
## F-statistic: 47.79 on 1 and 8 DF,  p-value: 0.0001229
abline(a = 71.2479,   # a: the intercept 
       b = 4.1507     # b: slope
      )

# You can use the following as well.
abline(m, col = "red", lty = "dashed")

Another example:

Year=1986:2007
T_bill_3Month = c(5.98, 5.78, 6.67, 8.11, 7.50, 5.38, 3.43, 3.33, 4.25, 5.49, 5.01, 5.06, 4.78, 4.64, 5.82, 3.40, 1.61, 1.01, 1.37, 3.16, 4.73, 4.48)
Inflation = c(1.10, 4.40, 4.40, 4.65, 6.11, 3.06, 2.90, 2.75, 2.67, 2.54, 3.32, 1.70, 1.61, 2.70, 3.40, 1.55, 2.49, 1.87, 3.26, 3.42, 2.54, 4.08)

par(mfrow = c(2,2)) # Split the screen into two rows and two columns
plot(T_bill_3Month~Year)
plot(Inflation~Year)
plot(T_bill_3Month~Inflation)

m= lm(T_bill_3Month~Inflation) # Fit a simple regression model
abline(m, lty = "dotted", col = "blue") # Add the regression line to the scatterplot

par(mfrow = c(1,1)) # Split the screen back to the default setting: 1 row and 2 columns

The following example shows how to create a custom barplot.

D=data.frame(year= c(2015:2018), Support=c(0.56, 0.41, 0.34, 0.67))
D$Disupport = 1-D$Support

bp =barplot(Support ~ year, 
            data = D, 
            horiz = TRUE, 
            xlim = c(-1,2), 
            space = 0, 
            col="green",
            xaxt = "n",   # Remove values on axis
            ann = FALSE   # Remove x label
           )  # The numbers (0.5, 1.5, 2.5, 3.5) on y axis are stored in bp

barplot(Support-1 ~ year, data = D, horiz = TRUE, add = TRUE, space=0, col = "pink",
        xaxt = "n", ann = FALSE)

text(D$Support, bp, labels = paste0(D$Support*100, "%"), pos = 4)


text(D$Support-1, bp, labels = paste0((1-D$Support)*100, "%"), pos = 2)

text(0, rev(as.numeric(bp))[1] + 0.6, labels = "Support", col = "green", pos = 4)

text(0, rev(as.numeric(bp))[1] + 0.6, labels = "Disupport", col = "pink", pos = 2)

In the above code,

  • “horiz = TRUE” means to show the bars horizontally.

  • pos = 1, 2, 3, or 4 indicates positions below, to the left of, above and to the right of the specified (x,y) coordinates.

Another more complicated example: are things in your country heading in the right direction, or are they off on the wrong track? Here is a reference: https://www.ipsos.com/sites/default/files/ct/news/documents/2020-07/www-july2020-global-summary-final.pdf

D=data.frame(Country = c("CountryA", "CountryB", "CountryC", "CountryD"), RunsInRightDirectionYear2020=c(0.56, 0.41, 0.34, 0.68), RunsInRightDirectionYear2021=c(0.45, 0.48, 0.54, 0.82))

par(mar = c(1, 5, 4, 1))
bp =barplot(RunsInRightDirectionYear2021 ~ Country, 
            data = D, 
            horiz = TRUE, 
            xlim = c(-1,3), 
            ylim = c(0, 5),  # Increase the y range so that texts added later are seen
            space = 0.1, 
            col="green",
            xaxt = "n",   # Remove values on axis
            ann = FALSE,   # Remove x label
            las = 2
           )

title("Are Things in Your Country Heading in the Right\nDirection, or Are They Off on the Wrong Track?", adj = 0, line = 0)

barplot(RunsInRightDirectionYear2021 - 1 ~ Country, 
        data = D, 
        horiz = TRUE, 
        xlim = c(-1,3), 
        ylim = c(0, 5),
        add = TRUE, 
        space=0.1, 
        col = "red",
        xaxt = "n",
        ann = FALSE,
        las = 2
       )

text(D$RunsInRightDirectionYear2021, 
     bp, 
     labels = paste0(D$RunsInRightDirectionYear2021*100, "%"), 
     pos = 4
    )

text(D$RunsInRightDirectionYear2021 - 1, 
     bp, 
     labels = paste0((1-D$RunsInRightDirectionYear2021)*100, "%"), 
     pos = 2
    )

text(0, rev(as.numeric(bp))[1] + 0.7, labels = "Right", col = "green", pos = 4)

text(0, rev(as.numeric(bp))[1] + 0.7, labels = "Wrong", col = "red", pos = 2)

text(2, rev(as.numeric(bp))[1] + 0.7, 
     labels = "% change compared \nto previous year", 
     col = "black"
    )
change = (D$RunsInRightDirectionYear2021-D$RunsInRightDirectionYear2020)*100
lbls = ifelse(change >= 0, 
                     eval(substitute(paste("↑",change))),               
                     eval(substitute(paste("↓",abs(change))))
             )
               

clrs = ifelse(change >= 0, "green", "red")  
a = 90
text(2, bp, labels = lbls, col = clrs
)

To create a pie chart, use the pie() function.

# Create a vector of values to be used in the pie chart
x <- c(10,20,30,40)

# Create a vector of labels, one for each sector
mylabel <- c("Apples", "Bananas", "Cherries", "Dates")

# Create a vector of colors
colors <- c("blue", "yellow", "green", "black")

# Display the pie chart
pie(x, label = paste0(mylabel, ": ", x), main = "Pie Chart", col = colors)

pie(mtcars$cyl)  # Not appropriate. Instead, do the following two lines:

T = table(mtcars$cyl) # Make a frequency table first
pie(T, label = paste0("Cylinder: ", names(T), " (", as.numeric(T), ")"))  # Must tabulate your data before creating a pie chart

# Let's also review the barplots
barplot(mtcars$cyl)  # Not appropriate

barplot(table(mtcars$cyl),    # Must tabulate your data before creating a pie chart
        main = "Distribution of Cars by Cylinder",   # Using a title is a good habit 
        col = 1:3   # just for demonstration
       )  

Plots for two categorical variables are side-by-side bar plots and mosaic plots.

barplot(table(mtcars$cyl, mtcars$carb), 
        beside = TRUE, 
        col = c("blue","red", "green"),
        ylab = "Number of cars"
       )
legend("topright", legend = c("Cyl = 4","Cyl = 6", "Cyl = 8"), 
       text.col = c("blue","red", "green"),
       bty="n")

barplot(table(mtcars$cyl, mtcars$carb), beside = FALSE, col = c(1,2,3))

mosaicplot(carb~cyl, data=mtcars) # Mosaic plot; can be used also for 3 variables

The ggplot2 Package

This package allows you to create different kinds of plots using a more generalized approach. It is based on the graphic concept that a graph is a combination of 3 independent components: data, aesthetics, and geometry, where

  • data is a data frame (maybe NULL),

  • aesthetics is used to set the x- and y-axes or to choose the shape, size, or color of the plotting character.

  • geometry refers to the type of graphics (such as scatter plot, line plot, histogram, boxplot, density curve, barplot, pie chart, violin plot).

We use examples to demonstrate the use of ggplot2 package when plotting data based on the following situations:

  • One continuous variable (histogram, boxplot, density curve)

  • One categorical variable (barplot, pie chart)

  • Two continuous variables (scatter plot, line plot)

  • One continuous variable and one categorical variable (side-by side boxplots, overlaid density curves, overlaid histograms, violin plots)

  • Two categorical variables (side-by-side barplots, mosaic plots)

  • More than 3 variables

head(iris) # Check the data
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
ggplot(iris) # Does not plot anything, since no x or y coordinate is specified

ggplot(iris, aes(x = Sepal.Length)) # Only plot the coordinate system

ggplot(iris, aes(x = Sepal.Length)) + 
  geom_histogram(bins = 6, 
                 color = "blue", 
                 fill = "red"
                ) # Add a layer

ggplot(iris, aes(x = Sepal.Length)) + geom_density(color = "blue") 

ggplot(iris, aes(x = Sepal.Length)) + geom_boxplot(color = "blue") 

ggplot(iris, aes(x = Species)) + geom_bar(color = "blue", fill = "red") 

ggplot(iris, aes(y = Species)) + geom_bar(color = "blue", fill = "red") # horizontal bars

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(col = "blue", size = 2)  # shape = pch in base plot

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_line(col = "blue", size = 2)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(mtcars, aes(x = factor(cyl), y = wt, fill = factor(cyl))) +
  geom_boxplot() + 
  theme(legend.position = "none") +
  labs(x = "Cylinder", y = "Weight", title = "My Plots", subtitle = "Made by XYZ", caption = "Data courtesy")

ggplot(mtcars, aes(x = wt, color = factor(cyl))) +
  geom_density() +
  labs(x = "Weight", color = "Cylinder")  # Legend's title is set via color here

ggplot(mtcars, aes(x = factor(cyl), y = wt, fill = factor(cyl))) +
  geom_violin() +
  labs(x = "Weight", fill = "Cylinder")  # Legend's title is set via color here

The Pipe

R sometimes use a special pipe operator for passing data from one function to another. It is based on the mathematics notation: f(g(h(x))) is equivalent to x -> h() -> g() -> f(). In R, we do: x %>% h() %>% g() %>% f().

x = c(12, 20, 18, 8, 24)

mean(x) 
## [1] 16.4
x %>% mean()
## [1] 16.4
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
mtcars %>% summary()
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000