knitr::opts_chunk$set(echo = TRUE)

library(ggplot2) # it's good idea to place your packages in this code chunk
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(highcharter) 
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

About R

R is an interpreted language. When you enter expressions into the R console (or run an R script in batch mode), a program within the R system, called the interpreter, executes the actual code that you wrote. Unlike C, CPP, and Java, there is no need to compile your programs into an object language. Other examples of interpreted languages are Common Lisp, Perl, and JavaScript. (R in a Nutshell, 2nd Edition, by Joseph Adler)

A good reference for the R programming language is https://www.tutorialspoint.com/r/index.htm.

Load R Packages

When running code written in R, some packages might be needed. Theses packages must be first installed in one of two ways:

  • Install on the console by issuing

    install.packages(“The package name in double or single quotes”)

  • Or go to the menu of the lower-right window of your computer screen, click the “Packages” tab and then the “Install” tab, type the package name you want to install, and click “Install” button. The console will show the progress of this installation.

Installation of a package only needs to be done once. To remove a package from your computer, go to the lower-right window again, check the package name in the list of packages, and click “x” at the right margin of your computer screen.

When your R code uses a function or a dataset from a particular package, you need to load the package by issuing

library(“the package name with or without quotes”)

print("Please install the 'igraph' package.")
## [1] "Please install the 'igraph' package."
library(igraph) # Load the package "igraph"
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
# A graph with directed edges: 1->2 2->4 3->1 2->1 3->2 4->1 4->3
g1 <- graph(edges=c(1,2, 2,4, 3,1, 2,1, 3,2, 4,1, 4,3), n=4, directed=TRUE)
## Warning: `graph()` was deprecated in igraph 2.1.0.
## ℹ Please use `make_graph()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plot(g1) # A plot of the network

Code Chunk

To run R code or other code within RStudio environment, the code must be in a code chuck. An R code chuck look like

# Your many lines of code are in between
# x=3
# y=4

To add a code chunk, a shortcut is to click the “Insert” tab on the upper-left window of your computer screen, and then choose the appropriate programming language.

Using Comments in Your Code

It helps yourself and your readers when you put comments with your code. A comment must be prefixed by one or more #’s. Anything after # will be treated as a comment. To comment multiple lines of code on a mac computer, highlight those lines and then use “shift control c”.

Defining Basic R Objects

Objects are the instances of classes. Everything in R is an object of a certain class. Each class has a certain structure. Basic data structures in R include vectors, matrices, data frames, lists, and factors.

Vectors in R

Vectors are one-dimensional arrays. Elements of a vector can be either all numeric values or all character strings. If one element of a vector is a string, the other elements will be treated as strings automatically.

x = 4 # This defines a scalar, which is a numeric vector of length 1
print(x) # This prints x. The name "print" can be omitted.
## [1] 4
y = c(2, 5, 9, 10) # This defines a numeric vecor of length 4. The 4 elements are 2, 5, 9, and 10.
z = 1:10 # This defines a patterned numeric vector of length 10. The elements are 1, 2, ..., 10.
t = seq(3, 100, by = 10) # An arithmetic sequence (vector) with an initial term 3 and an increament 10.
a = y^2 # This defines a numeric vector with elements being the square of the elements of the numeric vector y.
b = log(y) # Natural log-transformation of z to b
u = "Hello World!" # This defines a character vector of length 1.
v = c("David", "Mike", "Rich") # This dedines a character vector of length 3.
w = c("Haha", "Hehe", 5, 10) # the elements of 5 and 10 will be converted to string automatically.
print(w)
## [1] "Haha" "Hehe" "5"    "10"
class(y) # This shows the class of the R object y.
## [1] "numeric"
class(w)
## [1] "character"

Matrices and Data Frames in R

Matrices are 2-dimensional arrays. A matrix can only hold elements that are either all numeric values or all characters, but a data frame can hold numeric values in some columns and characters in other columns.

M = matrix(1:20, nrow = 4, byrow = TRUE) # This defines a matrix dimension 4 by 5, with elements being 1 through 20.
M
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    6    7    8    9   10
## [3,]   11   12   13   14   15
## [4,]   16   17   18   19   20
dim(M) # This displays the dimension of matrix M.
## [1] 4 5
rownames(M) = c("Row1", "Row2", "Row3", "Row4")
colnames(M) = c("Col1", "Col2", "Col3", "Col4", "Col5")
dimnames(M) # This displays both row names and column names
## [[1]]
## [1] "Row1" "Row2" "Row3" "Row4"
## 
## [[2]]
## [1] "Col1" "Col2" "Col3" "Col4" "Col5"
M
##      Col1 Col2 Col3 Col4 Col5
## Row1    1    2    3    4    5
## Row2    6    7    8    9   10
## Row3   11   12   13   14   15
## Row4   16   17   18   19   20
D = data.frame(y, a, b, grade = c("A", "B", "B+", "A-") ) # This deines a data frame.
dimnames(D) # This gives both row names and column names of the data frame D
## [[1]]
## [1] "1" "2" "3" "4"
## 
## [[2]]
## [1] "y"     "a"     "b"     "grade"
rownames(D) = c("Jenny", "Henny", "Bob", "Tod") # Change row names of D
colnames(D) = c("Y", "A", "B", "Grade") # Change column names. Equivalently, you can use the function "names".

D
##        Y   A         B Grade
## Jenny  2   4 0.6931472     A
## Henny  5  25 1.6094379     B
## Bob    9  81 2.1972246    B+
## Tod   10 100 2.3025851    A-
class(M)
## [1] "matrix" "array"
class(D)
## [1] "data.frame"

Lists in R

Lists in R can hold different elements of any kind. Lists are very important when displaying the outputs of model fitting.

myList = list(A=1:5, B = matrix(1:8, nrow = 2, byrow = TRUE), C = "Hello!", D = data.frame(x=1:4, y = 9:6))
print(myList)
## $A
## [1] 1 2 3 4 5
## 
## $B
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## 
## $C
## [1] "Hello!"
## 
## $D
##   x y
## 1 1 9
## 2 2 8
## 3 3 7
## 4 4 6
class(myList)
## [1] "list"

Factors in R

To encode a vector as a factor with certain levels, use the R function “factor”. Levels must be elements in the vector. Labels are optional. When a vector is converted to a factor, it becomes a categorical variable, which can be ordinal or nominal, depending on whether the option “order = TRUE” is set (default is FALSE).

v = c(0, 4, 4, 2, 3, 1, 0, 3, 4, 2, 0, 1)# The following encodes the vector v as a factor with levels 0 through 4.
x = factor(v, levels = c(0, 1, 2, 3, 4), labels = c("zero", "one", "two", "three", "four"), order = TRUE) 

levels(x) # This displays the levels of the factor x
## [1] "zero"  "one"   "two"   "three" "four"
class(v)
## [1] "numeric"
class(x)
## [1] "ordered" "factor"
w = c("Premium","Premium","Premium","Ideal","Very Good","Very Good","Good","Ideal","Premium","Premium","Ideal","Very Good")
z1 = factor(w, levels = c("Good", "Very Good", "Premium", "Ideal"), labels = c("G", "VG", "P", "I"), order = TRUE)
z1
##  [1] P  P  P  I  VG VG G  I  P  P  I  VG
## Levels: G < VG < P < I
z2 = factor(w, levels = c("Good", "Very Good", "Premium", "Ideal"), order = FALSE)
z2
##  [1] Premium   Premium   Premium   Ideal     Very Good Very Good Good     
##  [8] Ideal     Premium   Premium   Ideal     Very Good
## Levels: Good Very Good Premium Ideal

Subsetting in R

You can pull out part of elements of a data structure by some subsetting operations.

x = (1:8)/10
x[3] # Extract the third element from vector x
## [1] 0.3
x[2:5] # Extract 2nd to 5th elements as a vector
## [1] 0.2 0.3 0.4 0.5
M = matrix(1:35, nrow = 5, byrow = TRUE)

M[4] # Extract the 4th element of M
## [1] 22
M[4, ] # Extract the 4th row as a vector
## [1] 22 23 24 25 26 27 28
M[, 4] # Extract the 4th column as a vector
## [1]  4 11 18 25 32
M[2, 5] # Extract the element at the intersection of second row and 5th column.
## [1] 12
M[2:4, ] # Extract the second to 4th rows
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]    8    9   10   11   12   13   14
## [2,]   15   16   17   18   19   20   21
## [3,]   22   23   24   25   26   27   28
M[, 2:4] # Extract the second to 4th columns
##      [,1] [,2] [,3]
## [1,]    2    3    4
## [2,]    9   10   11
## [3,]   16   17   18
## [4,]   23   24   25
## [5,]   30   31   32
y = data.frame(a=1:10, b = 5:14)

y[2] # Extract the second column and make it a new data frame (not useful).
##     b
## 1   5
## 2   6
## 3   7
## 4   8
## 5   9
## 6  10
## 7  11
## 8  12
## 9  13
## 10 14
y[[2]] # Extract the second column as a vector.
##  [1]  5  6  7  8  9 10 11 12 13 14
y[2, ] # Extract the second row as a vector.
##   a b
## 2 2 6
y[, 2] # Same as y[[2]]
##  [1]  5  6  7  8  9 10 11 12 13 14
y$b # Extract the "b" column as a vector. Dollar is good, but not necessary!
##  [1]  5  6  7  8  9 10 11 12 13 14
y$"b" # Same as y$b
##  [1]  5  6  7  8  9 10 11 12 13 14
y["b"] # Same as y[2], a new data frame (not useful).
##     b
## 1   5
## 2   6
## 3   7
## 4   8
## 5   9
## 6  10
## 7  11
## 8  12
## 9  13
## 10 14
y[,"b"] # Same as y[, 2]
##  [1]  5  6  7  8  9 10 11 12 13 14
y[["b"]] # Same as y["b"]
##  [1]  5  6  7  8  9 10 11 12 13 14
y[3:6, ] # Extract the 3rd to 6th rows as a new data frame
##   a  b
## 3 3  7
## 4 4  8
## 5 5  9
## 6 6 10
myList = list(A=1:5, B = matrix(1:8, nrow = 2, byrow = TRUE), C = "Hello!", D = data.frame(x=1:4, y = 9:6))
myList[4] # Extract the 4th element as a new list with only one element D.
## $D
##   x y
## 1 1 9
## 2 2 8
## 3 3 7
## 4 4 6
myList[[4]] # Not a list any more
##   x y
## 1 1 9
## 2 2 8
## 3 3 7
## 4 4 6
myList$D # Same as myList[[4]]
##   x y
## 1 1 9
## 2 2 8
## 3 3 7
## 4 4 6

Logical Expressions in R

Frequently, we need to compare two R objects whether they are the same or not, or one is greater.

x = 4
y = 5
z = (x > y)
print(z)
## [1] FALSE
w = (x <= y)
print(w)
## [1] TRUE
a = "abc"
b = "abC"
d = (a != b) # Is a not equal to b?
print(d)
## [1] TRUE
q = c(2, 9, 11, 45, 34, 8, 24, 15, 5, 7, 21)
r = 5
s = (q>r)
print(s)
##  [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
u = TRUE
v = "TRUE"
class(u)
## [1] "logical"
class(v)
## [1] "character"
D2 = mtcars[mtcars$cyl == 4, c(2, 7, 9)] # A subset of mtcars for which cyl = 4.
D2
##                cyl  qsec am
## Datsun 710       4 18.61  1
## Merc 240D        4 20.00  0
## Merc 230         4 22.90  0
## Fiat 128         4 19.47  1
## Honda Civic      4 18.52  1
## Toyota Corolla   4 19.90  1
## Toyota Corona    4 20.01  0
## Fiat X1-9        4 18.90  1
## Porsche 914-2    4 16.70  1
## Lotus Europa     4 16.90  1
## Volvo 142E       4 18.60  1
t = q[q>10] # a vector containing values that are greater than 10
print(t)
## [1] 11 45 34 24 15 21

Conditional Branching in R

x = 15

# 4 branches: The number line is divided into 4 intervals: 
# (-infinity, 10), [10, 20), [20, 30), and [30, infinity)
if (x < 10){
  y = 2*x - 3
} else if (x <20){
  y = 3*x + 4
} else if (x < 30){
  y = 5*x - 12
} else{
  y = 10000
}

print(y)
## [1] 49
# 2 branches
States = c("MN", "FL", "IL", "CA")
state = "IL"
if (state %in% States){
  message = "Found it!"
} else{
  message = "Not found."
}

print(message)
## [1] "Found it!"

Looping in R

A loop in a programming language can perform the operation repeatedly. Like many other programming languages, R has for loops and while loops.

## The following gives a way of calculating the sum of the first 100 natural numbers.
sum = 0 # Initial value is 0
for (k in 1:100){
  sum = sum + k
}

print(sum)
## [1] 5050
## The following gives another way of calculating the sum of the first 100 natural numbers.
sum = 0
k = 1
while (k <= 100){
  sum = sum + k
  k = k + 1
}

print(sum)
## [1] 5050
## Or, we simply call a function to do the job
sum(1:100)
## [1] 5050

Functions in R

A function in R is an object so the R interpreter is able to pass control to the function, along with arguments that may be necessary for the function to accomplish the actions. The function in turn performs its task and returns control to the interpreter as well as any result which may be stored in other objects. (From https://www.tutorialspoint.com/r/r_functions.htm)

Built-in Functions in R

Lots of built-in functions are available in R. We have used quite many function above. To check out the details of a built-in function in R, type ?functionName in the R console.

A few very useful built-in functions are demonstrated below.

# Create a vector of 10 zeros
x = numeric(10)
x
##  [1] 0 0 0 0 0 0 0 0 0 0
x[4] = 1000 # Reset the number element of the numeric vector to 1000
x
##  [1]    0    0    0 1000    0    0    0    0    0    0
# Create a vector of 10 empty space characters
y = character(10)

print(mtcars) # "mtcars" is a data frame available from the "base" package
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
head(mtcars, n = 10) # Display only the first 10 rows of the data
##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
nrow(mtcars) # Display the number of rows in the data
## [1] 32
names(mtcars) # Display the column names of data
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
colnames(mtcars) # column names
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
rownames(mtcars) # row names
##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"
dimnames(mtcars) # Both row and column names
## [[1]]
##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"         
## 
## [[2]]
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
str(mtcars) # Display the structure of the mtcars data frame
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
class(mtcars) # The class of the data
## [1] "data.frame"
summary(mtcars) # Summarize each column of the mtcars data frame
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
plot(mtcars) # Scatterplot matrix

# Add a new column to a data frame
D = mtcars # Create a copy
D$log.mpg = log(D$mpg)
D$sq.wt = D$wt^2
D
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
##                      log.mpg     sq.wt
## Mazda RX4           3.044522  6.864400
## Mazda RX4 Wag       3.044522  8.265625
## Datsun 710          3.126761  5.382400
## Hornet 4 Drive      3.063391 10.336225
## Hornet Sportabout   2.928524 11.833600
## Valiant             2.895912 11.971600
## Duster 360          2.660260 12.744900
## Merc 240D           3.194583 10.176100
## Merc 230            3.126761  9.922500
## Merc 280            2.954910 11.833600
## Merc 280C           2.879198 11.833600
## Merc 450SE          2.797281 16.564900
## Merc 450SL          2.850707 13.912900
## Merc 450SLC         2.721295 14.288400
## Cadillac Fleetwood  2.341806 27.562500
## Lincoln Continental 2.341806 29.419776
## Chrysler Imperial   2.687847 28.569025
## Fiat 128            3.478158  4.840000
## Honda Civic         3.414443  2.608225
## Toyota Corolla      3.523415  3.367225
## Toyota Corona       3.068053  6.076225
## Dodge Challenger    2.740840 12.390400
## AMC Javelin         2.721295 11.799225
## Camaro Z28          2.587764 14.745600
## Pontiac Firebird    2.954910 14.784025
## Fiat X1-9           3.306887  3.744225
## Porsche 914-2       3.258097  4.579600
## Lotus Europa        3.414443  2.289169
## Ford Pantera L      2.760010 10.048900
## Ferrari Dino        2.980619  7.672900
## Maserati Bora       2.708050 12.744900
## Volvo 142E          3.063391  7.728400
# Equivalently
library(dplyr)
D = mutate(mtcars, 
           log.mpg = log(mpg),
           sq.wt = wt^2
          )
D
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
##                      log.mpg     sq.wt
## Mazda RX4           3.044522  6.864400
## Mazda RX4 Wag       3.044522  8.265625
## Datsun 710          3.126761  5.382400
## Hornet 4 Drive      3.063391 10.336225
## Hornet Sportabout   2.928524 11.833600
## Valiant             2.895912 11.971600
## Duster 360          2.660260 12.744900
## Merc 240D           3.194583 10.176100
## Merc 230            3.126761  9.922500
## Merc 280            2.954910 11.833600
## Merc 280C           2.879198 11.833600
## Merc 450SE          2.797281 16.564900
## Merc 450SL          2.850707 13.912900
## Merc 450SLC         2.721295 14.288400
## Cadillac Fleetwood  2.341806 27.562500
## Lincoln Continental 2.341806 29.419776
## Chrysler Imperial   2.687847 28.569025
## Fiat 128            3.478158  4.840000
## Honda Civic         3.414443  2.608225
## Toyota Corolla      3.523415  3.367225
## Toyota Corona       3.068053  6.076225
## Dodge Challenger    2.740840 12.390400
## AMC Javelin         2.721295 11.799225
## Camaro Z28          2.587764 14.745600
## Pontiac Firebird    2.954910 14.784025
## Fiat X1-9           3.306887  3.744225
## Porsche 914-2       3.258097  4.579600
## Lotus Europa        3.414443  2.289169
## Ford Pantera L      2.760010 10.048900
## Ferrari Dino        2.980619  7.672900
## Maserati Bora       2.708050 12.744900
## Volvo 142E          3.063391  7.728400
# Rename columns of a data frame
D = mtcars
D = rename(D,
       "mile per gallon" = mpg,
       "cylinder" = cyl,
       "horse power" = hp
      )
D
##                     mile per gallon cylinder  disp horse power drat    wt  qsec
## Mazda RX4                      21.0        6 160.0         110 3.90 2.620 16.46
## Mazda RX4 Wag                  21.0        6 160.0         110 3.90 2.875 17.02
## Datsun 710                     22.8        4 108.0          93 3.85 2.320 18.61
## Hornet 4 Drive                 21.4        6 258.0         110 3.08 3.215 19.44
## Hornet Sportabout              18.7        8 360.0         175 3.15 3.440 17.02
## Valiant                        18.1        6 225.0         105 2.76 3.460 20.22
## Duster 360                     14.3        8 360.0         245 3.21 3.570 15.84
## Merc 240D                      24.4        4 146.7          62 3.69 3.190 20.00
## Merc 230                       22.8        4 140.8          95 3.92 3.150 22.90
## Merc 280                       19.2        6 167.6         123 3.92 3.440 18.30
## Merc 280C                      17.8        6 167.6         123 3.92 3.440 18.90
## Merc 450SE                     16.4        8 275.8         180 3.07 4.070 17.40
## Merc 450SL                     17.3        8 275.8         180 3.07 3.730 17.60
## Merc 450SLC                    15.2        8 275.8         180 3.07 3.780 18.00
## Cadillac Fleetwood             10.4        8 472.0         205 2.93 5.250 17.98
## Lincoln Continental            10.4        8 460.0         215 3.00 5.424 17.82
## Chrysler Imperial              14.7        8 440.0         230 3.23 5.345 17.42
## Fiat 128                       32.4        4  78.7          66 4.08 2.200 19.47
## Honda Civic                    30.4        4  75.7          52 4.93 1.615 18.52
## Toyota Corolla                 33.9        4  71.1          65 4.22 1.835 19.90
## Toyota Corona                  21.5        4 120.1          97 3.70 2.465 20.01
## Dodge Challenger               15.5        8 318.0         150 2.76 3.520 16.87
## AMC Javelin                    15.2        8 304.0         150 3.15 3.435 17.30
## Camaro Z28                     13.3        8 350.0         245 3.73 3.840 15.41
## Pontiac Firebird               19.2        8 400.0         175 3.08 3.845 17.05
## Fiat X1-9                      27.3        4  79.0          66 4.08 1.935 18.90
## Porsche 914-2                  26.0        4 120.3          91 4.43 2.140 16.70
## Lotus Europa                   30.4        4  95.1         113 3.77 1.513 16.90
## Ford Pantera L                 15.8        8 351.0         264 4.22 3.170 14.50
## Ferrari Dino                   19.7        6 145.0         175 3.62 2.770 15.50
## Maserati Bora                  15.0        8 301.0         335 3.54 3.570 14.60
## Volvo 142E                     21.4        4 121.0         109 4.11 2.780 18.60
##                     vs am gear carb
## Mazda RX4            0  1    4    4
## Mazda RX4 Wag        0  1    4    4
## Datsun 710           1  1    4    1
## Hornet 4 Drive       1  0    3    1
## Hornet Sportabout    0  0    3    2
## Valiant              1  0    3    1
## Duster 360           0  0    3    4
## Merc 240D            1  0    4    2
## Merc 230             1  0    4    2
## Merc 280             1  0    4    4
## Merc 280C            1  0    4    4
## Merc 450SE           0  0    3    3
## Merc 450SL           0  0    3    3
## Merc 450SLC          0  0    3    3
## Cadillac Fleetwood   0  0    3    4
## Lincoln Continental  0  0    3    4
## Chrysler Imperial    0  0    3    4
## Fiat 128             1  1    4    1
## Honda Civic          1  1    4    2
## Toyota Corolla       1  1    4    1
## Toyota Corona        1  0    3    1
## Dodge Challenger     0  0    3    2
## AMC Javelin          0  0    3    2
## Camaro Z28           0  0    3    4
## Pontiac Firebird     0  0    3    2
## Fiat X1-9            1  1    4    1
## Porsche 914-2        0  1    5    2
## Lotus Europa         1  1    5    2
## Ford Pantera L       0  1    5    4
## Ferrari Dino         0  1    5    6
## Maserati Bora        0  1    5    8
## Volvo 142E           1  1    4    2
# Alternatively
D = mtcars
names(D)[c(1, 2, 4)] = c("mile per gallon", "cylinder", "horse power")
D
##                     mile per gallon cylinder  disp horse power drat    wt  qsec
## Mazda RX4                      21.0        6 160.0         110 3.90 2.620 16.46
## Mazda RX4 Wag                  21.0        6 160.0         110 3.90 2.875 17.02
## Datsun 710                     22.8        4 108.0          93 3.85 2.320 18.61
## Hornet 4 Drive                 21.4        6 258.0         110 3.08 3.215 19.44
## Hornet Sportabout              18.7        8 360.0         175 3.15 3.440 17.02
## Valiant                        18.1        6 225.0         105 2.76 3.460 20.22
## Duster 360                     14.3        8 360.0         245 3.21 3.570 15.84
## Merc 240D                      24.4        4 146.7          62 3.69 3.190 20.00
## Merc 230                       22.8        4 140.8          95 3.92 3.150 22.90
## Merc 280                       19.2        6 167.6         123 3.92 3.440 18.30
## Merc 280C                      17.8        6 167.6         123 3.92 3.440 18.90
## Merc 450SE                     16.4        8 275.8         180 3.07 4.070 17.40
## Merc 450SL                     17.3        8 275.8         180 3.07 3.730 17.60
## Merc 450SLC                    15.2        8 275.8         180 3.07 3.780 18.00
## Cadillac Fleetwood             10.4        8 472.0         205 2.93 5.250 17.98
## Lincoln Continental            10.4        8 460.0         215 3.00 5.424 17.82
## Chrysler Imperial              14.7        8 440.0         230 3.23 5.345 17.42
## Fiat 128                       32.4        4  78.7          66 4.08 2.200 19.47
## Honda Civic                    30.4        4  75.7          52 4.93 1.615 18.52
## Toyota Corolla                 33.9        4  71.1          65 4.22 1.835 19.90
## Toyota Corona                  21.5        4 120.1          97 3.70 2.465 20.01
## Dodge Challenger               15.5        8 318.0         150 2.76 3.520 16.87
## AMC Javelin                    15.2        8 304.0         150 3.15 3.435 17.30
## Camaro Z28                     13.3        8 350.0         245 3.73 3.840 15.41
## Pontiac Firebird               19.2        8 400.0         175 3.08 3.845 17.05
## Fiat X1-9                      27.3        4  79.0          66 4.08 1.935 18.90
## Porsche 914-2                  26.0        4 120.3          91 4.43 2.140 16.70
## Lotus Europa                   30.4        4  95.1         113 3.77 1.513 16.90
## Ford Pantera L                 15.8        8 351.0         264 4.22 3.170 14.50
## Ferrari Dino                   19.7        6 145.0         175 3.62 2.770 15.50
## Maserati Bora                  15.0        8 301.0         335 3.54 3.570 14.60
## Volvo 142E                     21.4        4 121.0         109 4.11 2.780 18.60
##                     vs am gear carb
## Mazda RX4            0  1    4    4
## Mazda RX4 Wag        0  1    4    4
## Datsun 710           1  1    4    1
## Hornet 4 Drive       1  0    3    1
## Hornet Sportabout    0  0    3    2
## Valiant              1  0    3    1
## Duster 360           0  0    3    4
## Merc 240D            1  0    4    2
## Merc 230             1  0    4    2
## Merc 280             1  0    4    4
## Merc 280C            1  0    4    4
## Merc 450SE           0  0    3    3
## Merc 450SL           0  0    3    3
## Merc 450SLC          0  0    3    3
## Cadillac Fleetwood   0  0    3    4
## Lincoln Continental  0  0    3    4
## Chrysler Imperial    0  0    3    4
## Fiat 128             1  1    4    1
## Honda Civic          1  1    4    2
## Toyota Corolla       1  1    4    1
## Toyota Corona        1  0    3    1
## Dodge Challenger     0  0    3    2
## AMC Javelin          0  0    3    2
## Camaro Z28           0  0    3    4
## Pontiac Firebird     0  0    3    2
## Fiat X1-9            1  1    4    1
## Porsche 914-2        0  1    5    2
## Lotus Europa         1  1    5    2
## Ford Pantera L       0  1    5    4
## Ferrari Dino         0  1    5    6
## Maserati Bora        0  1    5    8
## Volvo 142E           1  1    4    2
# Subsetting a data frame by selecting some columns
library(dplyr)
select(mtcars, disp, wt, hp)
##                      disp    wt  hp
## Mazda RX4           160.0 2.620 110
## Mazda RX4 Wag       160.0 2.875 110
## Datsun 710          108.0 2.320  93
## Hornet 4 Drive      258.0 3.215 110
## Hornet Sportabout   360.0 3.440 175
## Valiant             225.0 3.460 105
## Duster 360          360.0 3.570 245
## Merc 240D           146.7 3.190  62
## Merc 230            140.8 3.150  95
## Merc 280            167.6 3.440 123
## Merc 280C           167.6 3.440 123
## Merc 450SE          275.8 4.070 180
## Merc 450SL          275.8 3.730 180
## Merc 450SLC         275.8 3.780 180
## Cadillac Fleetwood  472.0 5.250 205
## Lincoln Continental 460.0 5.424 215
## Chrysler Imperial   440.0 5.345 230
## Fiat 128             78.7 2.200  66
## Honda Civic          75.7 1.615  52
## Toyota Corolla       71.1 1.835  65
## Toyota Corona       120.1 2.465  97
## Dodge Challenger    318.0 3.520 150
## AMC Javelin         304.0 3.435 150
## Camaro Z28          350.0 3.840 245
## Pontiac Firebird    400.0 3.845 175
## Fiat X1-9            79.0 1.935  66
## Porsche 914-2       120.3 2.140  91
## Lotus Europa         95.1 1.513 113
## Ford Pantera L      351.0 3.170 264
## Ferrari Dino        145.0 2.770 175
## Maserati Bora       301.0 3.570 335
## Volvo 142E          121.0 2.780 109
# Alternatively
mtcars[ , c(3, 6, 4)]
##                      disp    wt  hp
## Mazda RX4           160.0 2.620 110
## Mazda RX4 Wag       160.0 2.875 110
## Datsun 710          108.0 2.320  93
## Hornet 4 Drive      258.0 3.215 110
## Hornet Sportabout   360.0 3.440 175
## Valiant             225.0 3.460 105
## Duster 360          360.0 3.570 245
## Merc 240D           146.7 3.190  62
## Merc 230            140.8 3.150  95
## Merc 280            167.6 3.440 123
## Merc 280C           167.6 3.440 123
## Merc 450SE          275.8 4.070 180
## Merc 450SL          275.8 3.730 180
## Merc 450SLC         275.8 3.780 180
## Cadillac Fleetwood  472.0 5.250 205
## Lincoln Continental 460.0 5.424 215
## Chrysler Imperial   440.0 5.345 230
## Fiat 128             78.7 2.200  66
## Honda Civic          75.7 1.615  52
## Toyota Corolla       71.1 1.835  65
## Toyota Corona       120.1 2.465  97
## Dodge Challenger    318.0 3.520 150
## AMC Javelin         304.0 3.435 150
## Camaro Z28          350.0 3.840 245
## Pontiac Firebird    400.0 3.845 175
## Fiat X1-9            79.0 1.935  66
## Porsche 914-2       120.3 2.140  91
## Lotus Europa         95.1 1.513 113
## Ford Pantera L      351.0 3.170 264
## Ferrari Dino        145.0 2.770 175
## Maserati Bora       301.0 3.570 335
## Volvo 142E          121.0 2.780 109
# or 
mtcars[c(3, 6, 4)]
##                      disp    wt  hp
## Mazda RX4           160.0 2.620 110
## Mazda RX4 Wag       160.0 2.875 110
## Datsun 710          108.0 2.320  93
## Hornet 4 Drive      258.0 3.215 110
## Hornet Sportabout   360.0 3.440 175
## Valiant             225.0 3.460 105
## Duster 360          360.0 3.570 245
## Merc 240D           146.7 3.190  62
## Merc 230            140.8 3.150  95
## Merc 280            167.6 3.440 123
## Merc 280C           167.6 3.440 123
## Merc 450SE          275.8 4.070 180
## Merc 450SL          275.8 3.730 180
## Merc 450SLC         275.8 3.780 180
## Cadillac Fleetwood  472.0 5.250 205
## Lincoln Continental 460.0 5.424 215
## Chrysler Imperial   440.0 5.345 230
## Fiat 128             78.7 2.200  66
## Honda Civic          75.7 1.615  52
## Toyota Corolla       71.1 1.835  65
## Toyota Corona       120.1 2.465  97
## Dodge Challenger    318.0 3.520 150
## AMC Javelin         304.0 3.435 150
## Camaro Z28          350.0 3.840 245
## Pontiac Firebird    400.0 3.845 175
## Fiat X1-9            79.0 1.935  66
## Porsche 914-2       120.3 2.140  91
## Lotus Europa         95.1 1.513 113
## Ford Pantera L      351.0 3.170 264
## Ferrari Dino        145.0 2.770 175
## Maserati Bora       301.0 3.570 335
## Volvo 142E          121.0 2.780 109
# or 
mtcars[, c("disp", "wt", "hp")]
##                      disp    wt  hp
## Mazda RX4           160.0 2.620 110
## Mazda RX4 Wag       160.0 2.875 110
## Datsun 710          108.0 2.320  93
## Hornet 4 Drive      258.0 3.215 110
## Hornet Sportabout   360.0 3.440 175
## Valiant             225.0 3.460 105
## Duster 360          360.0 3.570 245
## Merc 240D           146.7 3.190  62
## Merc 230            140.8 3.150  95
## Merc 280            167.6 3.440 123
## Merc 280C           167.6 3.440 123
## Merc 450SE          275.8 4.070 180
## Merc 450SL          275.8 3.730 180
## Merc 450SLC         275.8 3.780 180
## Cadillac Fleetwood  472.0 5.250 205
## Lincoln Continental 460.0 5.424 215
## Chrysler Imperial   440.0 5.345 230
## Fiat 128             78.7 2.200  66
## Honda Civic          75.7 1.615  52
## Toyota Corolla       71.1 1.835  65
## Toyota Corona       120.1 2.465  97
## Dodge Challenger    318.0 3.520 150
## AMC Javelin         304.0 3.435 150
## Camaro Z28          350.0 3.840 245
## Pontiac Firebird    400.0 3.845 175
## Fiat X1-9            79.0 1.935  66
## Porsche 914-2       120.3 2.140  91
## Lotus Europa         95.1 1.513 113
## Ford Pantera L      351.0 3.170 264
## Ferrari Dino        145.0 2.770 175
## Maserati Bora       301.0 3.570 335
## Volvo 142E          121.0 2.780 109
# or
mtcars[c("disp", "wt", "hp")]
##                      disp    wt  hp
## Mazda RX4           160.0 2.620 110
## Mazda RX4 Wag       160.0 2.875 110
## Datsun 710          108.0 2.320  93
## Hornet 4 Drive      258.0 3.215 110
## Hornet Sportabout   360.0 3.440 175
## Valiant             225.0 3.460 105
## Duster 360          360.0 3.570 245
## Merc 240D           146.7 3.190  62
## Merc 230            140.8 3.150  95
## Merc 280            167.6 3.440 123
## Merc 280C           167.6 3.440 123
## Merc 450SE          275.8 4.070 180
## Merc 450SL          275.8 3.730 180
## Merc 450SLC         275.8 3.780 180
## Cadillac Fleetwood  472.0 5.250 205
## Lincoln Continental 460.0 5.424 215
## Chrysler Imperial   440.0 5.345 230
## Fiat 128             78.7 2.200  66
## Honda Civic          75.7 1.615  52
## Toyota Corolla       71.1 1.835  65
## Toyota Corona       120.1 2.465  97
## Dodge Challenger    318.0 3.520 150
## AMC Javelin         304.0 3.435 150
## Camaro Z28          350.0 3.840 245
## Pontiac Firebird    400.0 3.845 175
## Fiat X1-9            79.0 1.935  66
## Porsche 914-2       120.3 2.140  91
## Lotus Europa         95.1 1.513 113
## Ford Pantera L      351.0 3.170 264
## Ferrari Dino        145.0 2.770 175
## Maserati Bora       301.0 3.570 335
## Volvo 142E          121.0 2.780 109
# We can also deselect some columns to form a subset
select(mtcars, -am, -carb)
##                      mpg cyl  disp  hp drat    wt  qsec vs gear
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1    4
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1    3
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0    3
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1    3
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0    3
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1    4
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1    4
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0    3
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0    3
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0    3
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1    4
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1    4
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1    4
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1    3
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0    3
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0    3
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0    3
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0    3
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1    4
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0    5
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1    5
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0    5
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0    5
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0    5
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1    4
# Subsetting a data frame by selecting some rows meeting some conditions
D = subset(mtcars, cyl == 6 & gear %in% c(3, 5) & hp > 100)
D
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Valiant        18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
## Ferrari Dino   19.7   6  145 175 3.62 2.770 15.50  0  1    5    6
# or
D = mtcars[mtcars$cyl == 6 & mtcars$gear %in% c(3, 5) & mtcars$hp > 100, ]
D
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Valiant        18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
## Ferrari Dino   19.7   6  145 175 3.62 2.770 15.50  0  1    5    6
str(iris) # The data frame "iris" is also from the "base" package
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
plot(iris)

head(mpg) # Display the first 6 rows of data frame "mpg"
## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…
table(mpg$class) # Tabulate the class column in the data frame "mpg": a frequency table
## 
##    2seater    compact    midsize    minivan     pickup subcompact        suv 
##          5         47         41         11         33         35         62
# Create a random sample from a finite population with known elements
population = c(12, 56, 87, 43, 56, 54, 82, 34, 61, 52, 84, 97, 37, 28, 39)
y = sample(x = population, size = 5, replace = FALSE) # Sampling 5 values from the population without replacement
y
## [1] 54 39 37 84 43
# Create a sample from a discrete population with a known distribution. 
z = sample(1:3, size = 1000, prob = c(0.7, 0.15, 0.15), replace = TRUE) # The sampling will have to be w/ replacement
table(z)/1000*100 # Check the quality of the sample to see how close it is to the population 
## z
##    1    2    3 
## 70.8 15.9 13.3
# A question: How can we write a function that randomly choose a given number of rows 
# from an existing data frame to create a new data frame? This is called data partition.

# Group your data by one or more categorical variables and then summarize the grouped data
library(dplyr) # Load the package
mySummary = mtcars %>% group_by(vs, am) %>% summarise(n = n())
## `summarise()` has grouped output by 'vs'. You can override using the `.groups`
## argument.
mySummary
## # A tibble: 4 × 3
## # Groups:   vs [2]
##      vs    am     n
##   <dbl> <dbl> <int>
## 1     0     0    12
## 2     0     1     6
## 3     1     0     7
## 4     1     1     7
# Remove all the objects we created so far. This can be very useful!
rm(list = ls()) 

# Round values
x = c(100.45, 67.35, 78.82, 98.43, - 67.41, -84.92)
round(x, 1) # round to one decimal place
## [1] 100.4  67.3  78.8  98.4 -67.4 -84.9
round(x, 0) # round to the nearest whole number
## [1] 100  67  79  98 -67 -85
round(x, -1) # round to the nearest 10th
## [1] 100  70  80 100 -70 -80
# Paste a few strings with a separator.
paste("Tomorrow is ", Sys.Date() + 1, ", the due date for ", "project #", 15, ". ", "Don't miss it!", sep = "")
## [1] "Tomorrow is 2026-01-15, the due date for project #15. Don't miss it!"
# The switch() function for conditional execution
x = c(45, 78, 93, 25, 54, 80)
stats = "Sd"
switch(stats,
       Mean = mean(x),
       SD = sd(x),
       Median = median(x),
       Summary = summary(x),
       cat("Sorry, it goes beyond my capacity.")
)
## Sorry, it goes beyond my capacity.
# Handling dates
y = c(34, 56, 61, 78, 84, 92, 100, 120, 125)
x = c("1990-05-01", "1990-05-02", "1990-05-03", "1990-05-04", "1990-05-05", "1990-05-06", "1990-05-07", "1990-05-08", "1990-05-09")
dx = as.Date(x)
dx
## [1] "1990-05-01" "1990-05-02" "1990-05-03" "1990-05-04" "1990-05-05"
## [6] "1990-05-06" "1990-05-07" "1990-05-08" "1990-05-09"
class(x)
## [1] "character"
class(dx)
## [1] "Date"
plot(y~dx, xlab = "Date")

D = Sys.Date() # Extract the date of today
weekdays(D) # Extract the week day
## [1] "Wednesday"
months(D)
## [1] "January"
quarters(D)
## [1] "Q1"
julian(D) # Number of days since the origin (1970-01-01)
## [1] 20467
## attr(,"origin")
## [1] "1970-01-01"
julian(D, origin = as.Date("2000-07-01")) # Number of days since the origin (2000-07-01)
## [1] 9328
## attr(,"origin")
## [1] "2000-07-01"
# Find the indices of minimum and maximum of a vector
z = c(56, 76, 34, 12, 98, 45, 32, 77)
which.max(z)
## [1] 5
which.min(z)
## [1] 4

User-defined Functions in R

R users can also define their own functions. The structure of a user-defined function is R looks like the following:

  # functionName = function(a list of parameters/arguments separated by comma){
  #    The function body with the last line being the returned value (can be any data structure)
  # }

The different parts of a function are

  • Function Name- This is the actual name of the function. It is stored in R environment as an object with this name.

  • Arguments- An argument is a placeholder. When a function is invoked, you pass a value to the argument. Arguments are optional; that is, a function may contain no arguments. Also arguments can have default values.

  • Function Body- The function body contains a collection of statements that defines what the function does.

  • Return Value- The return value of a function is the last expression in the function body to be evaluated.

(From https://www.tutorialspoint.com/r/r_functions.htm)

f1 = function(x){
  2*x^2 - 3/x +1
}

# A better one that handles abnormality
f2 = function(x){
  if (x != 0) {
    2*x^2 - 3/x +1
  } else {
    cat("Can't be done due to zero denominator!\n", "Please use a non-zero input.")
  }
}

# A function for a simple summary of a numeric sample
mySummary = function(x){
  Mean = mean(x)
  Median = median(x)
  Std = sd(x)
  
  list(Mean = Mean, Median = Median, "Standard Deviation" = Std)
  
}

# A function that prints the elements of a vector reversely
rprint = function(x){
  n = length(x)
  v = NULL
  for (i in 1:n){
    v[i] = x[n-i+1]
  }
  v
}

rprint(1:9)
## [1] 9 8 7 6 5 4 3 2 1
rprint(letters)
##  [1] "z" "y" "x" "w" "v" "u" "t" "s" "r" "q" "p" "o" "n" "m" "l" "k" "j" "i" "h"
## [20] "g" "f" "e" "d" "c" "b" "a"
# There is a built-in function in R that gives the reversal
rev(1:9)
## [1] 9 8 7 6 5 4 3 2 1
rev(letters)
##  [1] "z" "y" "x" "w" "v" "u" "t" "s" "r" "q" "p" "o" "n" "m" "l" "k" "j" "i" "h"
## [20] "g" "f" "e" "d" "c" "b" "a"
# A user-defined function that partitions a data frame into training data, validation data, and test data
partition = function(D, prop = c(0.6, 0.2, 0.2)){ # D is a data frame to partition
  n = nrow(D)
  idx = sample(x = 1:n, size = n, replace = FALSE) # Shuffle the original rows of D
  shuffled = D[idx, ]
  n1 = round(n * prop[1])
  n2 = round(n * prop[2])
  n3 = n - n1 - n2 # Can I do n3 = round(n * prop[3])?
  training = shuffled[1:n1, ]
  validation = shuffled[(n1+1):(n1+n2), ]
  test = shuffled[(n1 + n2 + 1):n, ]
  
  L = list(training = training, validation = validation, test = test)
  return(L)
}

partition(mtcars, c(0.7, 0.15, 0.15))
## $training
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## 
## $validation
##               mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Merc 240D    24.4   4 146.7  62 3.69 3.190 20.0  1  0    4    2
## Merc 450SE   16.4   8 275.8 180 3.07 4.070 17.4  0  0    3    3
## Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Merc 450SLC  15.2   8 275.8 180 3.07 3.780 18.0  0  0    3    3
## 
## $test
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

The Pipe Operator “%>%” in R

The pipe operator, %>% , comes from the “magrittr” package. The point of pipe is to help you write human-friendly code.

In mathematics, you can make a composite function by doing something like y = f(g(h(x))), which is equivalent to

\[x \rightarrow h() \rightarrow g() \rightarrow f() = y\] The above process involves 4 steps:

Step 1: Input “x” into the function “h”.

Step 2: Input the result “h(x)” from step 1 into the function “g”.

Step 3: Input the result “g(h(x))” from step 2 into the function “f”.

Step 4: The output is “f(g(h(x)))” and assigned to “y”.

In the “magrittr” package, the right arrow $” is represented by “%>%”. Here are examples.

x = c( 23, 45, 34, 78, 12, 56)
mean(x)
## [1] 41.33333
# The following 3 lines of code are each equivalent to the previous line
x %>% mean() # that is, we can factor x out!
## [1] 41.33333
x %>%mean
## [1] 41.33333
x %>% mean(.) # "." is a placeholder
## [1] 41.33333
x %>% sqrt() %>% sum() # A chain rule: just for fun
## [1] 37.11416
# The following gives a more realistic example.
library(dplyr) # The count() function is from this package.
print(starwars) # A dataset from the package
## # A tibble: 87 × 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
##  3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
##  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
##  5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
##  6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
##  7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
##  9 Biggs D…    183    84 black      light      brown           24   male  mascu…
## 10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
# The column "species" is a categorical variable in the data "starwars".
# The following gets its distribution, which is discrete.
D1 <- starwars %>% count(species) 
D1
## # A tibble: 38 × 2
##    species       n
##    <chr>     <int>
##  1 Aleena        1
##  2 Besalisk      1
##  3 Cerean        1
##  4 Chagrian      1
##  5 Clawdite      1
##  6 Droid         6
##  7 Dug           1
##  8 Ewok          1
##  9 Geonosian     1
## 10 Gungan        3
## # ℹ 28 more rows
# Plot the discrete distribution
barplot(height = D1$n, names = D1$species)

with(D1, barplot(height = n, names = species)) # Alternatively

# Sort the distribution table by the frequency (column "n")
D2 <- starwars %>% count(species, sort = TRUE)
D2
## # A tibble: 38 × 2
##    species      n
##    <chr>    <int>
##  1 Human       35
##  2 Droid        6
##  3 <NA>         4
##  4 Gungan       3
##  5 Kaminoan     2
##  6 Mirialan     2
##  7 Twi'lek      2
##  8 Wookiee      2
##  9 Zabrak       2
## 10 Aleena       1
## # ℹ 28 more rows
# Plot the discrete distribution
bp = barplot(height = D2$n, names = D2$species, ylim = c(0, max(D2$n)*1.1), las = 2) # A sorted barchart, called the Pareto chart

with(D2, barplot(height = n, names = species)) # Alternatively

# The following adds labels: above (pos = 3) bars by 10% of the size of the character width
text(bp, D2$n*0.9, labels = D2$n, pos = 3, offset = 0.1, col = "red")
title("Distribution of Species in Starwars", col.main = "blue", cex.main = 2, sub = "(data courtesy of xyz)")

# Another way to plot: just for illustration and not recommended
starwars %>% .$species %>% table() %>% sort(., decreasing = TRUE) %>% barplot(ylim = c(0, max(D2$n)*1.1), las = 2) %>% text(., D2$n, labels = D2$n, pos = 3, offset = 0.1, col = "red")

# Joint distribution of "sex" and "gender" and sort by the frequency (column "n")
D3 <- starwars %>% count(sex, gender, sort = TRUE)
D3
## # A tibble: 6 × 3
##   sex            gender        n
##   <chr>          <chr>     <int>
## 1 male           masculine    60
## 2 female         feminine     16
## 3 none           masculine     5
## 4 <NA>           <NA>          4
## 5 hermaphroditic masculine     1
## 6 none           feminine      1

Missing Values

Your data may have missing values. In R, missing values are indicated by “NA”. Missing values can be removed or imputed depending on the context. There are lots of research done on missing values.

x = c(2, 6, 9, NA, 10, 23, NA, 30)

print(x)
## [1]  2  6  9 NA 10 23 NA 30
mean(x) # Produces NA
## [1] NA
sd(x) # Produces NA
## [1] NA
summary(x) # NA's handled
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00    6.75    9.50   13.33   19.75   30.00       2
# Handling missing values by simply removing them with the "na.rm" option.
mean(x, na.rm = TRUE)
## [1] 13.33333
sd(x, na.rm = TRUE)
## [1] 10.80123
# Remove the missing values to create a new vector
y = as.numeric(na.omit(x))
print(y)
## [1]  2  6  9 10 23 30
mean(y)
## [1] 13.33333
D1 = data.frame(x = c(1:10, NA, 12:15), y = c(2, 4, 5, 1, 0, NA, 7.2, NA, 10, 13.4, 15.2, NA, 18.5, 11, 20.5))
D2 = na.omit(D1) # Remove rows with a missing value

D1
##     x    y
## 1   1  2.0
## 2   2  4.0
## 3   3  5.0
## 4   4  1.0
## 5   5  0.0
## 6   6   NA
## 7   7  7.2
## 8   8   NA
## 9   9 10.0
## 10 10 13.4
## 11 NA 15.2
## 12 12   NA
## 13 13 18.5
## 14 14 11.0
## 15 15 20.5
D2
##     x    y
## 1   1  2.0
## 2   2  4.0
## 3   3  5.0
## 4   4  1.0
## 5   5  0.0
## 7   7  7.2
## 9   9 10.0
## 10 10 13.4
## 13 13 18.5
## 14 14 11.0
## 15 15 20.5

Normalizing Data

When fitting machine learning models (neural network for regression, for example), we need to normalize data so that the transformed data become z-scores or scores that are within the range between 0 and 1. We can use the scale() function for creating z-scores and the preProcess() function from the caret package to transform data into the 0-1 range. The preProcess() function only creates a “recipe” (a formula), so it must be used in conjunction with the predict() function.

D = iris

D1 = cbind(scale(D[, 1:4]), iris[, 4])
D1
##        Sepal.Length Sepal.Width Petal.Length   Petal.Width    
##   [1,]  -0.89767388  1.01560199  -1.33575163 -1.3110521482 0.2
##   [2,]  -1.13920048 -0.13153881  -1.33575163 -1.3110521482 0.2
##   [3,]  -1.38072709  0.32731751  -1.39239929 -1.3110521482 0.2
##   [4,]  -1.50149039  0.09788935  -1.27910398 -1.3110521482 0.2
##   [5,]  -1.01843718  1.24503015  -1.33575163 -1.3110521482 0.2
##   [6,]  -0.53538397  1.93331463  -1.16580868 -1.0486667950 0.4
##   [7,]  -1.50149039  0.78617383  -1.33575163 -1.1798594716 0.3
##   [8,]  -1.01843718  0.78617383  -1.27910398 -1.3110521482 0.2
##   [9,]  -1.74301699 -0.36096697  -1.33575163 -1.3110521482 0.2
##  [10,]  -1.13920048  0.09788935  -1.27910398 -1.4422448248 0.1
##  [11,]  -0.53538397  1.47445831  -1.27910398 -1.3110521482 0.2
##  [12,]  -1.25996379  0.78617383  -1.22245633 -1.3110521482 0.2
##  [13,]  -1.25996379 -0.13153881  -1.33575163 -1.4422448248 0.1
##  [14,]  -1.86378030 -0.13153881  -1.50569459 -1.4422448248 0.1
##  [15,]  -0.05233076  2.16274279  -1.44904694 -1.3110521482 0.2
##  [16,]  -0.17309407  3.08045544  -1.27910398 -1.0486667950 0.4
##  [17,]  -0.53538397  1.93331463  -1.39239929 -1.0486667950 0.4
##  [18,]  -0.89767388  1.01560199  -1.33575163 -1.1798594716 0.3
##  [19,]  -0.17309407  1.70388647  -1.16580868 -1.1798594716 0.3
##  [20,]  -0.89767388  1.70388647  -1.27910398 -1.1798594716 0.3
##  [21,]  -0.53538397  0.78617383  -1.16580868 -1.3110521482 0.2
##  [22,]  -0.89767388  1.47445831  -1.27910398 -1.0486667950 0.4
##  [23,]  -1.50149039  1.24503015  -1.56234224 -1.3110521482 0.2
##  [24,]  -0.89767388  0.55674567  -1.16580868 -0.9174741184 0.5
##  [25,]  -1.25996379  0.78617383  -1.05251337 -1.3110521482 0.2
##  [26,]  -1.01843718 -0.13153881  -1.22245633 -1.3110521482 0.2
##  [27,]  -1.01843718  0.78617383  -1.22245633 -1.0486667950 0.4
##  [28,]  -0.77691058  1.01560199  -1.27910398 -1.3110521482 0.2
##  [29,]  -0.77691058  0.78617383  -1.33575163 -1.3110521482 0.2
##  [30,]  -1.38072709  0.32731751  -1.22245633 -1.3110521482 0.2
##  [31,]  -1.25996379  0.09788935  -1.22245633 -1.3110521482 0.2
##  [32,]  -0.53538397  0.78617383  -1.27910398 -1.0486667950 0.4
##  [33,]  -0.77691058  2.39217095  -1.27910398 -1.4422448248 0.1
##  [34,]  -0.41462067  2.62159911  -1.33575163 -1.3110521482 0.2
##  [35,]  -1.13920048  0.09788935  -1.27910398 -1.3110521482 0.2
##  [36,]  -1.01843718  0.32731751  -1.44904694 -1.3110521482 0.2
##  [37,]  -0.41462067  1.01560199  -1.39239929 -1.3110521482 0.2
##  [38,]  -1.13920048  1.24503015  -1.33575163 -1.4422448248 0.1
##  [39,]  -1.74301699 -0.13153881  -1.39239929 -1.3110521482 0.2
##  [40,]  -0.89767388  0.78617383  -1.27910398 -1.3110521482 0.2
##  [41,]  -1.01843718  1.01560199  -1.39239929 -1.1798594716 0.3
##  [42,]  -1.62225369 -1.73753594  -1.39239929 -1.1798594716 0.3
##  [43,]  -1.74301699  0.32731751  -1.39239929 -1.3110521482 0.2
##  [44,]  -1.01843718  1.01560199  -1.22245633 -0.7862814418 0.6
##  [45,]  -0.89767388  1.70388647  -1.05251337 -1.0486667950 0.4
##  [46,]  -1.25996379 -0.13153881  -1.33575163 -1.1798594716 0.3
##  [47,]  -0.89767388  1.70388647  -1.22245633 -1.3110521482 0.2
##  [48,]  -1.50149039  0.32731751  -1.33575163 -1.3110521482 0.2
##  [49,]  -0.65614727  1.47445831  -1.27910398 -1.3110521482 0.2
##  [50,]  -1.01843718  0.55674567  -1.33575163 -1.3110521482 0.2
##  [51,]   1.39682886  0.32731751   0.53362088  0.2632599711 1.4
##  [52,]   0.67224905  0.32731751   0.42032558  0.3944526477 1.5
##  [53,]   1.27606556  0.09788935   0.64691619  0.3944526477 1.5
##  [54,]  -0.41462067 -1.73753594   0.13708732  0.1320672944 1.3
##  [55,]   0.79301235 -0.59039513   0.47697323  0.3944526477 1.5
##  [56,]  -0.17309407 -0.59039513   0.42032558  0.1320672944 1.3
##  [57,]   0.55148575  0.55674567   0.53362088  0.5256453243 1.6
##  [58,]  -1.13920048 -1.50810778  -0.25944625 -0.2615107354 1.0
##  [59,]   0.91377565 -0.36096697   0.47697323  0.1320672944 1.3
##  [60,]  -0.77691058 -0.81982329   0.08043967  0.2632599711 1.4
##  [61,]  -1.01843718 -2.42582042  -0.14615094 -0.2615107354 1.0
##  [62,]   0.06843254 -0.13153881   0.25038262  0.3944526477 1.5
##  [63,]   0.18919584 -1.96696410   0.13708732 -0.2615107354 1.0
##  [64,]   0.30995914 -0.36096697   0.53362088  0.2632599711 1.4
##  [65,]  -0.29385737 -0.36096697  -0.08950329  0.1320672944 1.3
##  [66,]   1.03453895  0.09788935   0.36367793  0.2632599711 1.4
##  [67,]  -0.29385737 -0.13153881   0.42032558  0.3944526477 1.5
##  [68,]  -0.05233076 -0.81982329   0.19373497 -0.2615107354 1.0
##  [69,]   0.43072244 -1.96696410   0.42032558  0.3944526477 1.5
##  [70,]  -0.29385737 -1.27867961   0.08043967 -0.1303180588 1.1
##  [71,]   0.06843254  0.32731751   0.59026853  0.7880306775 1.8
##  [72,]   0.30995914 -0.59039513   0.13708732  0.1320672944 1.3
##  [73,]   0.55148575 -1.27867961   0.64691619  0.3944526477 1.5
##  [74,]   0.30995914 -0.59039513   0.53362088  0.0008746178 1.2
##  [75,]   0.67224905 -0.36096697   0.30703027  0.1320672944 1.3
##  [76,]   0.91377565 -0.13153881   0.36367793  0.2632599711 1.4
##  [77,]   1.15530226 -0.59039513   0.59026853  0.2632599711 1.4
##  [78,]   1.03453895 -0.13153881   0.70356384  0.6568380009 1.7
##  [79,]   0.18919584 -0.36096697   0.42032558  0.3944526477 1.5
##  [80,]  -0.17309407 -1.04925145  -0.14615094 -0.2615107354 1.0
##  [81,]  -0.41462067 -1.50810778   0.02379201 -0.1303180588 1.1
##  [82,]  -0.41462067 -1.50810778  -0.03285564 -0.2615107354 1.0
##  [83,]  -0.05233076 -0.81982329   0.08043967  0.0008746178 1.2
##  [84,]   0.18919584 -0.81982329   0.76021149  0.5256453243 1.6
##  [85,]  -0.53538397 -0.13153881   0.42032558  0.3944526477 1.5
##  [86,]   0.18919584  0.78617383   0.42032558  0.5256453243 1.6
##  [87,]   1.03453895  0.09788935   0.53362088  0.3944526477 1.5
##  [88,]   0.55148575 -1.73753594   0.36367793  0.1320672944 1.3
##  [89,]  -0.29385737 -0.13153881   0.19373497  0.1320672944 1.3
##  [90,]  -0.41462067 -1.27867961   0.13708732  0.1320672944 1.3
##  [91,]  -0.41462067 -1.04925145   0.36367793  0.0008746178 1.2
##  [92,]   0.30995914 -0.13153881   0.47697323  0.2632599711 1.4
##  [93,]  -0.05233076 -1.04925145   0.13708732  0.0008746178 1.2
##  [94,]  -1.01843718 -1.73753594  -0.25944625 -0.2615107354 1.0
##  [95,]  -0.29385737 -0.81982329   0.25038262  0.1320672944 1.3
##  [96,]  -0.17309407 -0.13153881   0.25038262  0.0008746178 1.2
##  [97,]  -0.17309407 -0.36096697   0.25038262  0.1320672944 1.3
##  [98,]   0.43072244 -0.36096697   0.30703027  0.1320672944 1.3
##  [99,]  -0.89767388 -1.27867961  -0.42938920 -0.1303180588 1.1
## [100,]  -0.17309407 -0.59039513   0.19373497  0.1320672944 1.3
## [101,]   0.55148575  0.55674567   1.27004036  1.7063794137 2.5
## [102,]  -0.05233076 -0.81982329   0.76021149  0.9192233541 1.9
## [103,]   1.51759216 -0.13153881   1.21339271  1.1816087073 2.1
## [104,]   0.55148575 -0.36096697   1.04344975  0.7880306775 1.8
## [105,]   0.79301235 -0.13153881   1.15674505  1.3128013839 2.2
## [106,]   2.12140867 -0.13153881   1.60992627  1.1816087073 2.1
## [107,]  -1.13920048 -1.27867961   0.42032558  0.6568380009 1.7
## [108,]   1.75911877 -0.36096697   1.43998331  0.7880306775 1.8
## [109,]   1.03453895 -1.27867961   1.15674505  0.7880306775 1.8
## [110,]   1.63835547  1.24503015   1.32668801  1.7063794137 2.5
## [111,]   0.79301235  0.32731751   0.76021149  1.0504160307 2.0
## [112,]   0.67224905 -0.81982329   0.87350679  0.9192233541 1.9
## [113,]   1.15530226 -0.13153881   0.98680210  1.1816087073 2.1
## [114,]  -0.17309407 -1.27867961   0.70356384  1.0504160307 2.0
## [115,]  -0.05233076 -0.59039513   0.76021149  1.5751867371 2.4
## [116,]   0.67224905  0.32731751   0.87350679  1.4439940605 2.3
## [117,]   0.79301235 -0.13153881   0.98680210  0.7880306775 1.8
## [118,]   2.24217198  1.70388647   1.66657392  1.3128013839 2.2
## [119,]   2.24217198 -1.04925145   1.77986923  1.4439940605 2.3
## [120,]   0.18919584 -1.96696410   0.70356384  0.3944526477 1.5
## [121,]   1.27606556  0.32731751   1.10009740  1.4439940605 2.3
## [122,]  -0.29385737 -0.59039513   0.64691619  1.0504160307 2.0
## [123,]   2.24217198 -0.59039513   1.66657392  1.0504160307 2.0
## [124,]   0.55148575 -0.81982329   0.64691619  0.7880306775 1.8
## [125,]   1.03453895  0.55674567   1.10009740  1.1816087073 2.1
## [126,]   1.63835547  0.32731751   1.27004036  0.7880306775 1.8
## [127,]   0.43072244 -0.59039513   0.59026853  0.7880306775 1.8
## [128,]   0.30995914 -0.13153881   0.64691619  0.7880306775 1.8
## [129,]   0.67224905 -0.59039513   1.04344975  1.1816087073 2.1
## [130,]   1.63835547 -0.13153881   1.15674505  0.5256453243 1.6
## [131,]   1.87988207 -0.59039513   1.32668801  0.9192233541 1.9
## [132,]   2.48369858  1.70388647   1.49663097  1.0504160307 2.0
## [133,]   0.67224905 -0.59039513   1.04344975  1.3128013839 2.2
## [134,]   0.55148575 -0.59039513   0.76021149  0.3944526477 1.5
## [135,]   0.30995914 -1.04925145   1.04344975  0.2632599711 1.4
## [136,]   2.24217198 -0.13153881   1.32668801  1.4439940605 2.3
## [137,]   0.55148575  0.78617383   1.04344975  1.5751867371 2.4
## [138,]   0.67224905  0.09788935   0.98680210  0.7880306775 1.8
## [139,]   0.18919584 -0.13153881   0.59026853  0.7880306775 1.8
## [140,]   1.27606556  0.09788935   0.93015445  1.1816087073 2.1
## [141,]   1.03453895  0.09788935   1.04344975  1.5751867371 2.4
## [142,]   1.27606556  0.09788935   0.76021149  1.4439940605 2.3
## [143,]  -0.05233076 -0.81982329   0.76021149  0.9192233541 1.9
## [144,]   1.15530226  0.32731751   1.21339271  1.4439940605 2.3
## [145,]   1.03453895  0.55674567   1.10009740  1.7063794137 2.5
## [146,]   1.03453895 -0.13153881   0.81685914  1.4439940605 2.3
## [147,]   0.55148575 -1.27867961   0.70356384  0.9192233541 1.9
## [148,]   0.79301235 -0.13153881   0.81685914  1.0504160307 2.0
## [149,]   0.43072244  0.78617383   0.93015445  1.4439940605 2.3
## [150,]   0.06843254 -0.13153881   0.76021149  0.7880306775 1.8
D2 = caret::preProcess(iris) %>% predict(D)

D2
##     Sepal.Length Sepal.Width Petal.Length   Petal.Width    Species
## 1    -0.89767388  1.01560199  -1.33575163 -1.3110521482     setosa
## 2    -1.13920048 -0.13153881  -1.33575163 -1.3110521482     setosa
## 3    -1.38072709  0.32731751  -1.39239929 -1.3110521482     setosa
## 4    -1.50149039  0.09788935  -1.27910398 -1.3110521482     setosa
## 5    -1.01843718  1.24503015  -1.33575163 -1.3110521482     setosa
## 6    -0.53538397  1.93331463  -1.16580868 -1.0486667950     setosa
## 7    -1.50149039  0.78617383  -1.33575163 -1.1798594716     setosa
## 8    -1.01843718  0.78617383  -1.27910398 -1.3110521482     setosa
## 9    -1.74301699 -0.36096697  -1.33575163 -1.3110521482     setosa
## 10   -1.13920048  0.09788935  -1.27910398 -1.4422448248     setosa
## 11   -0.53538397  1.47445831  -1.27910398 -1.3110521482     setosa
## 12   -1.25996379  0.78617383  -1.22245633 -1.3110521482     setosa
## 13   -1.25996379 -0.13153881  -1.33575163 -1.4422448248     setosa
## 14   -1.86378030 -0.13153881  -1.50569459 -1.4422448248     setosa
## 15   -0.05233076  2.16274279  -1.44904694 -1.3110521482     setosa
## 16   -0.17309407  3.08045544  -1.27910398 -1.0486667950     setosa
## 17   -0.53538397  1.93331463  -1.39239929 -1.0486667950     setosa
## 18   -0.89767388  1.01560199  -1.33575163 -1.1798594716     setosa
## 19   -0.17309407  1.70388647  -1.16580868 -1.1798594716     setosa
## 20   -0.89767388  1.70388647  -1.27910398 -1.1798594716     setosa
## 21   -0.53538397  0.78617383  -1.16580868 -1.3110521482     setosa
## 22   -0.89767388  1.47445831  -1.27910398 -1.0486667950     setosa
## 23   -1.50149039  1.24503015  -1.56234224 -1.3110521482     setosa
## 24   -0.89767388  0.55674567  -1.16580868 -0.9174741184     setosa
## 25   -1.25996379  0.78617383  -1.05251337 -1.3110521482     setosa
## 26   -1.01843718 -0.13153881  -1.22245633 -1.3110521482     setosa
## 27   -1.01843718  0.78617383  -1.22245633 -1.0486667950     setosa
## 28   -0.77691058  1.01560199  -1.27910398 -1.3110521482     setosa
## 29   -0.77691058  0.78617383  -1.33575163 -1.3110521482     setosa
## 30   -1.38072709  0.32731751  -1.22245633 -1.3110521482     setosa
## 31   -1.25996379  0.09788935  -1.22245633 -1.3110521482     setosa
## 32   -0.53538397  0.78617383  -1.27910398 -1.0486667950     setosa
## 33   -0.77691058  2.39217095  -1.27910398 -1.4422448248     setosa
## 34   -0.41462067  2.62159911  -1.33575163 -1.3110521482     setosa
## 35   -1.13920048  0.09788935  -1.27910398 -1.3110521482     setosa
## 36   -1.01843718  0.32731751  -1.44904694 -1.3110521482     setosa
## 37   -0.41462067  1.01560199  -1.39239929 -1.3110521482     setosa
## 38   -1.13920048  1.24503015  -1.33575163 -1.4422448248     setosa
## 39   -1.74301699 -0.13153881  -1.39239929 -1.3110521482     setosa
## 40   -0.89767388  0.78617383  -1.27910398 -1.3110521482     setosa
## 41   -1.01843718  1.01560199  -1.39239929 -1.1798594716     setosa
## 42   -1.62225369 -1.73753594  -1.39239929 -1.1798594716     setosa
## 43   -1.74301699  0.32731751  -1.39239929 -1.3110521482     setosa
## 44   -1.01843718  1.01560199  -1.22245633 -0.7862814418     setosa
## 45   -0.89767388  1.70388647  -1.05251337 -1.0486667950     setosa
## 46   -1.25996379 -0.13153881  -1.33575163 -1.1798594716     setosa
## 47   -0.89767388  1.70388647  -1.22245633 -1.3110521482     setosa
## 48   -1.50149039  0.32731751  -1.33575163 -1.3110521482     setosa
## 49   -0.65614727  1.47445831  -1.27910398 -1.3110521482     setosa
## 50   -1.01843718  0.55674567  -1.33575163 -1.3110521482     setosa
## 51    1.39682886  0.32731751   0.53362088  0.2632599711 versicolor
## 52    0.67224905  0.32731751   0.42032558  0.3944526477 versicolor
## 53    1.27606556  0.09788935   0.64691619  0.3944526477 versicolor
## 54   -0.41462067 -1.73753594   0.13708732  0.1320672944 versicolor
## 55    0.79301235 -0.59039513   0.47697323  0.3944526477 versicolor
## 56   -0.17309407 -0.59039513   0.42032558  0.1320672944 versicolor
## 57    0.55148575  0.55674567   0.53362088  0.5256453243 versicolor
## 58   -1.13920048 -1.50810778  -0.25944625 -0.2615107354 versicolor
## 59    0.91377565 -0.36096697   0.47697323  0.1320672944 versicolor
## 60   -0.77691058 -0.81982329   0.08043967  0.2632599711 versicolor
## 61   -1.01843718 -2.42582042  -0.14615094 -0.2615107354 versicolor
## 62    0.06843254 -0.13153881   0.25038262  0.3944526477 versicolor
## 63    0.18919584 -1.96696410   0.13708732 -0.2615107354 versicolor
## 64    0.30995914 -0.36096697   0.53362088  0.2632599711 versicolor
## 65   -0.29385737 -0.36096697  -0.08950329  0.1320672944 versicolor
## 66    1.03453895  0.09788935   0.36367793  0.2632599711 versicolor
## 67   -0.29385737 -0.13153881   0.42032558  0.3944526477 versicolor
## 68   -0.05233076 -0.81982329   0.19373497 -0.2615107354 versicolor
## 69    0.43072244 -1.96696410   0.42032558  0.3944526477 versicolor
## 70   -0.29385737 -1.27867961   0.08043967 -0.1303180588 versicolor
## 71    0.06843254  0.32731751   0.59026853  0.7880306775 versicolor
## 72    0.30995914 -0.59039513   0.13708732  0.1320672944 versicolor
## 73    0.55148575 -1.27867961   0.64691619  0.3944526477 versicolor
## 74    0.30995914 -0.59039513   0.53362088  0.0008746178 versicolor
## 75    0.67224905 -0.36096697   0.30703027  0.1320672944 versicolor
## 76    0.91377565 -0.13153881   0.36367793  0.2632599711 versicolor
## 77    1.15530226 -0.59039513   0.59026853  0.2632599711 versicolor
## 78    1.03453895 -0.13153881   0.70356384  0.6568380009 versicolor
## 79    0.18919584 -0.36096697   0.42032558  0.3944526477 versicolor
## 80   -0.17309407 -1.04925145  -0.14615094 -0.2615107354 versicolor
## 81   -0.41462067 -1.50810778   0.02379201 -0.1303180588 versicolor
## 82   -0.41462067 -1.50810778  -0.03285564 -0.2615107354 versicolor
## 83   -0.05233076 -0.81982329   0.08043967  0.0008746178 versicolor
## 84    0.18919584 -0.81982329   0.76021149  0.5256453243 versicolor
## 85   -0.53538397 -0.13153881   0.42032558  0.3944526477 versicolor
## 86    0.18919584  0.78617383   0.42032558  0.5256453243 versicolor
## 87    1.03453895  0.09788935   0.53362088  0.3944526477 versicolor
## 88    0.55148575 -1.73753594   0.36367793  0.1320672944 versicolor
## 89   -0.29385737 -0.13153881   0.19373497  0.1320672944 versicolor
## 90   -0.41462067 -1.27867961   0.13708732  0.1320672944 versicolor
## 91   -0.41462067 -1.04925145   0.36367793  0.0008746178 versicolor
## 92    0.30995914 -0.13153881   0.47697323  0.2632599711 versicolor
## 93   -0.05233076 -1.04925145   0.13708732  0.0008746178 versicolor
## 94   -1.01843718 -1.73753594  -0.25944625 -0.2615107354 versicolor
## 95   -0.29385737 -0.81982329   0.25038262  0.1320672944 versicolor
## 96   -0.17309407 -0.13153881   0.25038262  0.0008746178 versicolor
## 97   -0.17309407 -0.36096697   0.25038262  0.1320672944 versicolor
## 98    0.43072244 -0.36096697   0.30703027  0.1320672944 versicolor
## 99   -0.89767388 -1.27867961  -0.42938920 -0.1303180588 versicolor
## 100  -0.17309407 -0.59039513   0.19373497  0.1320672944 versicolor
## 101   0.55148575  0.55674567   1.27004036  1.7063794137  virginica
## 102  -0.05233076 -0.81982329   0.76021149  0.9192233541  virginica
## 103   1.51759216 -0.13153881   1.21339271  1.1816087073  virginica
## 104   0.55148575 -0.36096697   1.04344975  0.7880306775  virginica
## 105   0.79301235 -0.13153881   1.15674505  1.3128013839  virginica
## 106   2.12140867 -0.13153881   1.60992627  1.1816087073  virginica
## 107  -1.13920048 -1.27867961   0.42032558  0.6568380009  virginica
## 108   1.75911877 -0.36096697   1.43998331  0.7880306775  virginica
## 109   1.03453895 -1.27867961   1.15674505  0.7880306775  virginica
## 110   1.63835547  1.24503015   1.32668801  1.7063794137  virginica
## 111   0.79301235  0.32731751   0.76021149  1.0504160307  virginica
## 112   0.67224905 -0.81982329   0.87350679  0.9192233541  virginica
## 113   1.15530226 -0.13153881   0.98680210  1.1816087073  virginica
## 114  -0.17309407 -1.27867961   0.70356384  1.0504160307  virginica
## 115  -0.05233076 -0.59039513   0.76021149  1.5751867371  virginica
## 116   0.67224905  0.32731751   0.87350679  1.4439940605  virginica
## 117   0.79301235 -0.13153881   0.98680210  0.7880306775  virginica
## 118   2.24217198  1.70388647   1.66657392  1.3128013839  virginica
## 119   2.24217198 -1.04925145   1.77986923  1.4439940605  virginica
## 120   0.18919584 -1.96696410   0.70356384  0.3944526477  virginica
## 121   1.27606556  0.32731751   1.10009740  1.4439940605  virginica
## 122  -0.29385737 -0.59039513   0.64691619  1.0504160307  virginica
## 123   2.24217198 -0.59039513   1.66657392  1.0504160307  virginica
## 124   0.55148575 -0.81982329   0.64691619  0.7880306775  virginica
## 125   1.03453895  0.55674567   1.10009740  1.1816087073  virginica
## 126   1.63835547  0.32731751   1.27004036  0.7880306775  virginica
## 127   0.43072244 -0.59039513   0.59026853  0.7880306775  virginica
## 128   0.30995914 -0.13153881   0.64691619  0.7880306775  virginica
## 129   0.67224905 -0.59039513   1.04344975  1.1816087073  virginica
## 130   1.63835547 -0.13153881   1.15674505  0.5256453243  virginica
## 131   1.87988207 -0.59039513   1.32668801  0.9192233541  virginica
## 132   2.48369858  1.70388647   1.49663097  1.0504160307  virginica
## 133   0.67224905 -0.59039513   1.04344975  1.3128013839  virginica
## 134   0.55148575 -0.59039513   0.76021149  0.3944526477  virginica
## 135   0.30995914 -1.04925145   1.04344975  0.2632599711  virginica
## 136   2.24217198 -0.13153881   1.32668801  1.4439940605  virginica
## 137   0.55148575  0.78617383   1.04344975  1.5751867371  virginica
## 138   0.67224905  0.09788935   0.98680210  0.7880306775  virginica
## 139   0.18919584 -0.13153881   0.59026853  0.7880306775  virginica
## 140   1.27606556  0.09788935   0.93015445  1.1816087073  virginica
## 141   1.03453895  0.09788935   1.04344975  1.5751867371  virginica
## 142   1.27606556  0.09788935   0.76021149  1.4439940605  virginica
## 143  -0.05233076 -0.81982329   0.76021149  0.9192233541  virginica
## 144   1.15530226  0.32731751   1.21339271  1.4439940605  virginica
## 145   1.03453895  0.55674567   1.10009740  1.7063794137  virginica
## 146   1.03453895 -0.13153881   0.81685914  1.4439940605  virginica
## 147   0.55148575 -1.27867961   0.70356384  0.9192233541  virginica
## 148   0.79301235 -0.13153881   0.81685914  1.0504160307  virginica
## 149   0.43072244  0.78617383   0.93015445  1.4439940605  virginica
## 150   0.06843254 -0.13153881   0.76021149  0.7880306775  virginica

Splitting Data into Training and Validation Sets

When evaluating the performance of a model that is fitted based on a training dataset, we need to use a validation dataset that is different from the training set. This requires us to split the original data. Usually, around 60% of the original data are used as traing data.

n = nrow(iris)

# Randomly select 60% of rows as training data. We first select 60% row labels randomly.
train_idx <- sample(nrow(iris), n * 0.6)

# Now extract the corresponding rows
train <- iris[train_idx, ]

# The remaining are used as validation data
validation <- iris[-train_idx, ]

Reading External Data into R

In R, we can read data from files stored outside the R environment. We can also write data into files which will be stored and accessed by the operating system. R can read and write into various file formats like csv, excel, etc.

If the data are on the local computer, it is convenient if they are stored in the same working directory where most of your R files are stored.

You can check which directory the R workspace is pointing to using the “getwd” function. You can also set a new working directory using “setwd” function.

getwd() # Get the working directory
## [1] "/cloud/project/MyDoc/Zhang/Stat415.515.615"
#setwd("/Users/home/Documents/Zhang/Stat415.515.615") # Set it to a new one
#getwd()

# MyData is a data frame made from the raw csv data
myData = read.csv("Sales.csv") 
head(myData, n = 20) # Display only the first 20 rows of the data frame
##    Item_Identifier Item_Weight Item_Fat_Content Item_Visibility
## 1            FDW58      20.750          Low Fat     0.007564836
## 2            FDW14       8.300              reg     0.038427677
## 3            NCN55      14.600          Low Fat     0.099574908
## 4            FDQ58       7.315          Low Fat     0.015388393
## 5            FDY38          NA          Regular     0.118599314
## 6            FDH56       9.800          Regular     0.063817206
## 7            FDL48      19.350          Regular     0.082601537
## 8            FDC48          NA          Low Fat     0.015782495
## 9            FDN33       6.305          Regular     0.123365446
## 10           FDA36       5.985          Low Fat     0.005698435
## 11           FDT44      16.600          Low Fat     0.103569075
## 12           FDQ56       6.590          Low Fat     0.105811470
## 13           NCC54          NA          Low Fat     0.171079215
## 14           FDU11       4.785          Low Fat     0.092737611
## 15           DRL59      16.750               LF     0.021206464
## 16           FDM24       6.135          Regular     0.079450700
## 17           FDI57      19.850          Low Fat     0.054135210
## 18           DRC12      17.850          Low Fat     0.037980963
## 19           NCM42          NA          Low Fat     0.028184344
## 20           FDA46      13.600          Low Fat     0.196897637
##                Item_Type Item_MRP Outlet_Identifier Outlet_Establishment_Year
## 1            Snack Foods 107.8622            OUT049                      1999
## 2                  Dairy  87.3198            OUT017                      2007
## 3                 Others 241.7538            OUT010                      1998
## 4            Snack Foods 155.0340            OUT017                      2007
## 5                  Dairy 234.2300            OUT027                      1985
## 6  Fruits and Vegetables 117.1492            OUT046                      1997
## 7           Baking Goods  50.1034            OUT018                      2009
## 8           Baking Goods  81.0592            OUT027                      1985
## 9            Snack Foods  95.7436            OUT045                      2002
## 10          Baking Goods 186.8924            OUT017                      2007
## 11 Fruits and Vegetables 118.3466            OUT017                      2007
## 12 Fruits and Vegetables  85.3908            OUT045                      2002
## 13    Health and Hygiene 240.4196            OUT019                      1985
## 14                Breads 122.3098            OUT049                      1999
## 15           Hard Drinks  52.0298            OUT013                      1987
## 16          Baking Goods 151.6366            OUT049                      1999
## 17               Seafood 198.7768            OUT045                      2002
## 18           Soft Drinks 192.2188            OUT018                      2009
## 19             Household 109.6912            OUT027                      1985
## 20           Snack Foods 193.7136            OUT010                      1998
##    Outlet_Size Outlet_Location_Type       Outlet_Type
## 1       Medium               Tier 1 Supermarket Type1
## 2                            Tier 2 Supermarket Type1
## 3                            Tier 3     Grocery Store
## 4                            Tier 2 Supermarket Type1
## 5       Medium               Tier 3 Supermarket Type3
## 6        Small               Tier 1 Supermarket Type1
## 7       Medium               Tier 3 Supermarket Type2
## 8       Medium               Tier 3 Supermarket Type3
## 9                            Tier 2 Supermarket Type1
## 10                           Tier 2 Supermarket Type1
## 11                           Tier 2 Supermarket Type1
## 12                           Tier 2 Supermarket Type1
## 13       Small               Tier 1     Grocery Store
## 14      Medium               Tier 1 Supermarket Type1
## 15        High               Tier 3 Supermarket Type1
## 16      Medium               Tier 1 Supermarket Type1
## 17                           Tier 2 Supermarket Type1
## 18      Medium               Tier 3 Supermarket Type2
## 19      Medium               Tier 3 Supermarket Type3
## 20                           Tier 3     Grocery Store

Data Visualization Using the “ggplot2” Package

A cheatsheet of ggplot2: https://rstudio.com/wp-content/uploads/2015/04/ggplot2-cheatsheet.pdf

You will need to practice the examples on your own in order to save the class time for other topics!

The structure of the code that creates a ggplot object in R is:

ggplot(data = ?, mapping = aes(x = ?, y = ?, color = ?, fill = ?, …)) + geom_*()

where * can be “line”, “point”, “bar”, “boxplot”, “density”, “dotplot”, “histogram”, “hline”, “vline”, “segment”, “text”, “smooth”, …

y = c(34, 56, 61, 78, 84, 92, 100, 120, 125)
x = c("1990-05-01", "1990-05-02", "1990-05-03", "1990-05-04", "1990-05-05", "1990-05-06", "1990-05-07", "1990-05-08", "1990-05-09")
dx = as.Date(x)

df = data.frame(x, y, dx)

library(ggplot2)

# Line plots
ggplot(data = df, mapping = aes(x=dx, y=y)) +
  geom_line(size = 3, color = "yellow", linetype = "solid") +
  scale_y_continuous(limits = c(40, 150)) + # Restrict the y-scale
  labs(title = "Some Line Plot", x = "Date", y = "Price", subtitle = "A First Example on ggplot", caption = "Made by XYZ") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, color = "red", face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, size = 10, color = "blue", face = "italic"),
        plot.caption = element_text(hjust = 1, size = 8, color = "green"),
        axis.text.x=element_text(size=12, color = "red"),
        axis.text.y=element_text(size=25, color = "pink", face = "bold"),
        axis.title.x=element_text(size=14,face="bold"),
        axis.title.y=element_text(size=30,face="bold"),
        plot.background = element_rect(fill = "lightyellow",
                                colour = "lightblue",
                                size = 0.5, linetype = "dashed"),
        panel.background = element_rect(fill = "lightgray",
                                colour = "lightblue",
                                size = 0.5, linetype = "solid"),
        panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                colour = "purple"), 
        panel.grid.minor = element_line(size = 0.25, linetype = 'dotted',
                                colour = "violet")
  )
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

# Barplots
ggplot(data = iris, mapping = aes(x = Species, fill = Species)) + 
  geom_bar() +
  theme(legend.position = "none")

# Density plot
ggplot(iris, aes(x = Sepal.Length, color = Species)) + 
  geom_density() +
  labs(color = "Species") + 
  coord_flip()

ggplot(mtcars, aes(x = mpg, color = factor(cyl))) + 
  geom_density() +
  labs(color = "Cyl") + 
  coord_flip()

# Violin plot
ggplot(iris, aes(y = Sepal.Length, x = Species)) + 
  geom_violin() +
  labs(x = "Sepal Length")

titanic = as.data.frame(Titanic)

ggplot(titanic, aes(x = Survived, y = Freq, fill = Class)) +
  geom_col(position = "dodge") # For aggregated (grouped) data, or geom_bar(stats = "identity)

ggplot(titanic, aes(x = Survived, y = Freq, fill = Class)) +
  geom_col(position = "stack") +
  labs(fill = "Classes")

# Faceting
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  facet_grid(cyl ~ .)

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  facet_grid(. ~ cyl) # The "." at left side can be omitted

# When there are many categories for the categorical variable
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  facet_wrap(. ~ manufacturer)

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  facet_grid(cyl ~ class)

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  facet_wrap(cyl ~ class)

# Making your plot interactive using the "plotly" package
# First store the ggplot object in a variable, called p here.
p = ggplot(mpg, aes(displ, hwy)) +
  geom_point()

library(plotly) # Must load the package first
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:igraph':
## 
##     groups
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
ggplotly(p)
# Better scatterplots using the "GGally" package
plot(mtcars) # scatterplot matrix

cor(mtcars) # correlation matrix
##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000
# Two in one
library(GGally)
mtcars %>% GGally::ggpairs()

Create a choropleth map

A choropleth map (in Greek ‘choro’ means ‘area/region’ and ‘pletho’ means ‘multitude’) is a type of thematic map in which a set of pre-defined areas is colored or patterned in proportion to a statistical variable that represents an aggregate summary of a geographic characteristic within each area, such as population density or per-capita income. (From Wiki)

Violent Crime Rates by US State

The following code is from an R documentation.

library(dplyr)
library(highcharter) 

data("USArrests", package = "datasets")
data("usgeojson")

USArrests$state <- rownames(USArrests)
# Alternatively, USArrests <- mutate(USArrests, state = rownames(USArrests))

highchart() %>%
  hc_title(text = "Violent Crime Rates by US State") %>%
  hc_subtitle(text = "Source: USArrests data") %>%
  hc_add_series_map(
    map = usgeojson, 
    df = USArrests,
    name = "Murder arrests (per 100,000)",
    value = "Murder", joinBy = c("woename", "state"),
    dataLabels = list(
      enabled = TRUE,
      format = "{point.properties.postalcode}"
    )
  ) %>%
  hc_colorAxis(stops = color_stops()) %>%
  hc_legend(valueDecimals = 0, valueSuffix = "%") %>%
  hc_mapNavigation(enabled = TRUE)

Create Heat Maps

A heat map (or heatmap) is a data visualization technique that shows magnitude of a phenomenon as color in two dimensions. The variation in color may be by hue or intensity, giving obvious visual cues to the reader about how the phenomenon is clustered or varies over space. (By Wiki)

Refer to: https://www.r-graph-gallery.com/215-the-heatmap-function.html

Heatmaps that display raw values or values after scaling

# The mtcars dataset:
data <- as.matrix(mtcars) # The heatmap function takes as input a matrix.

# Default Heatmap
heatmap(data)

The heatmap just generated is not very insightful: all the variation is absorbed by the “hp” and “disp” variables that have very high values compared to the others. We need to normalize the data.

Normalizing the matrix is done using the scale argument of the heatmap() function. It can be applied to row or to column. Here the column option is chosen, since we need to absorb the variation between column.

# Use 'scale' to normalize
heatmap(data, scale="column")

# No dendrogram nor reordering for neither column or row
heatmap(data, Colv = NA, Rowv = NA, scale="column")

A note: to scale a dataset on columns, use the scale() function in R, as shown below.

scaled.mtcars = scale(mtcars)

# Check
apply(scaled.mtcars, 2, mean) # All column means are now basically zero
##           mpg           cyl          disp            hp          drat 
##  7.112366e-17 -1.474515e-17 -9.084937e-17  1.040834e-17 -2.918672e-16 
##            wt          qsec            vs            am          gear 
##  4.681043e-17  5.299580e-16  6.938894e-18  4.510281e-17 -3.469447e-18 
##          carb 
##  3.165870e-17
apply(scaled.mtcars, 2, sd) # All column standard deviations are now one
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##    1    1    1    1    1    1    1    1    1    1    1

Heatmaps that display correlations and missing values

For correlation,

# No dendrogram at all
heatmap(cor(mtcars), Colv = NA, Rowv = NA) # Using correlation, so No need to scale

library(gplots)
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
heatmap.2(cor(mtcars), Colv = FALSE, Rowv = FALSE, , cellnote = round(cor(mtcars),2), notecol = "black", dendrogram = "none", trace="none", key = FALSE)

For missing values,

w = c(NA, 7, NA, 34, 12, NA, 44, 21, 26, 56, NA, 45, 34, 12)
x = c( 9, 12, 23, 31, NA, 24, 13, 26, NA, 43, NA, NA, 34, NA)
y = c(32, 11, 7, NA, 8, 2, 3, NA, 2, 8, 12, 21, 54, NA)
z = c(41,  23, NA, 51, 52, 43, NA, 31, NA, 34, 31, NA, 33, NA)
df = data.frame(w, x,y,z)

# What is the difference between the results of the following 2 lines of code?
is.na(x)
##  [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE
## [13] FALSE  TRUE
is.na(x)*1
##  [1] 0 0 0 0 1 0 0 0 1 0 1 1 0 1
# Generate a heatmap that displays missing values
heatmap(is.na(df)*1, Colv = NA, Rowv = NA)

Create Treemaps

A treemap is a space-filling visualization of hierarchical structures. The map is a set of nested rectangles. Each group is represented by a rectangle.

Here is a blog about treemaps: https://www.r-bloggers.com/2018/09/simple-steps-to-create-treemap-in-r/

# The code is from R documentation.

library(treemap)

data(GNI2014)

treemap(
  dtf = GNI2014,
  index=c("continent", "iso3"),
  vSize="population",
  vColor="GNI",
  type="value",
  format.legend = list(scientific = FALSE, big.mark = " "))