Introduction to R

knitr::opts_chunk$set(echo = TRUE)

library(ggplot2) # it's good idea to place your packages in this code chunk
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(highcharter)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

About R

R is an interpreted language. When you enter expressions into the R console (or run an R script in batch mode), a program within the R system, called the interpreter, executes the actual code that you wrote. Unlike C, CPP, and Java, there is no need to compile your programs into an object language. Other examples of interpreted languages are Common Lisp, Perl, and JavaScript. (R in a Nutshell, 2nd Edition, by Joseph Adler)

A good reference for the R programming language is https://www.tutorialspoint.com/r/index.htm.

Load R Packages

When running code written in R, some packages might be needed. Theses packages must be first installed in one of two ways:

Install on the console by issuing

install.packages(“The package name in double or single quotes”)
Or go to the menu of the lower-right window of your computer screen, click the “Packages” tab and then the “Install” tab, type the package name you want to install, and click “Install” button. The console will show the progress of this installation.

Installation of a package only needs to be done once. To remove a package from your computer, go to the lower-right window again, check the package name in the list of packages, and click “x” at the right margin of your computer screen.

When your R code uses a function or a dataset from a particular package, you need to load the package by issuing

library(“the package name with or without quotes”)

print("Please install the 'igraph' package.")

## [1] "Please install the 'igraph' package."

library(igraph) # Load the package "igraph"

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

# A graph with directed edges: 1->2 2->4 3->1 2->1 3->2 4->1 4->3
g1 <- graph(edges=c(1,2, 2,4, 3,1, 2,1, 3,2, 4,1, 4,3), n=4, directed=TRUE)
plot(g1) # A plot of the network

Code Chunk

To run R code or other code within RStudio environment, the code must be in a code chuck. An R code chuck look like

# Your many lines of code are in between
# x=3
# y=4

To add a code chunk, a shortcut is to click the “Insert” tab on the upper-left window of your computer screen, and then choose the appropriate programming language.

Using Comments in Your Code

It helps yourself and your readers when you put comments with your code. A comment must be prefixed by one or more #’s. Anything after # will be treated as a comment. To comment multiple lines of code on a mac computer, highlight those lines and then use “shift control c”.

Defining Basic R Objects

Objects are the instances of classes. Everything in R is an object of a certain class. Each class has a certain structure. Basic data structures in R include vectors, matrices, data frames, lists, and factors.

Vectors in R

Vectors are one-dimensional arrays. Elements of a vector can be either all numeric values or all character strings. If one element of a vector is a string, the other elements will be treated as strings automatically.

x = 4 # This defines a scalar, which is a numeric vector of length 1
print(x) # This prints x. The name "print" can be omitted.

## [1] 4

y = c(2, 5, 9, 10) # This defines a numeric vecor of length 4. The 4 elements are 2, 5, 9, and 10.
z = 1:10 # This defines a patterned numeric vector of length 10. The elements are 1, 2, ..., 10.
t = seq(3, 100, by = 10) # An arithmetic sequence (vector) with an initial term 3 and an increament 10.
a = y^2 # This defines a numeric vector with elements being the square of the elements of the numeric vector y.
b = log(y) # Natural log-transformation of z to b
u = "Hello World!" # This defines a character vector of length 1.
v = c("David", "Mike", "Rich") # This dedines a character vector of length 3.
w = c("Haha", "Hehe", 5, 10) # the elements of 5 and 10 will be converted to string automatically.
print(w)

## [1] "Haha" "Hehe" "5"    "10"

class(y) # This shows the class of the R object y.

## [1] "numeric"

class(w)

## [1] "character"

Matrices and Data Frames in R

Matrices are 2-dimensional arrays. A matrix can only hold elements that are either all numeric values or all characters, but a data frame can hold numeric values in some columns and characters in other columns.

M = matrix(1:20, nrow = 4, byrow = TRUE) # This defines a matrix dimension 4 by 5, with elements being 1 through 20.
M

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    6    7    8    9   10
## [3,]   11   12   13   14   15
## [4,]   16   17   18   19   20

dim(M) # This displays the dimension of matrix M.

## [1] 4 5

rownames(M) = c("Row1", "Row2", "Row3", "Row4")
colnames(M) = c("Col1", "Col2", "Col3", "Col4", "Col5")
dimnames(M) # This displays both row names and column names

## [[1]]
## [1] "Row1" "Row2" "Row3" "Row4"
## 
## [[2]]
## [1] "Col1" "Col2" "Col3" "Col4" "Col5"

##      Col1 Col2 Col3 Col4 Col5
## Row1    1    2    3    4    5
## Row2    6    7    8    9   10
## Row3   11   12   13   14   15
## Row4   16   17   18   19   20

D = data.frame(y, a, b, grade = c("A", "B", "B+", "A-") ) # This deines a data frame.
dimnames(D) # This gives both row names and column names of the data frame D

## [[1]]
## [1] "1" "2" "3" "4"
## 
## [[2]]
## [1] "y"     "a"     "b"     "grade"

rownames(D) = c("Jenny", "Henny", "Bob", "Tod") # Change row names of D
colnames(D) = c("Y", "A", "B", "Grade") # Change column names. Equivalently, you can use the function "names".

D

##        Y   A         B Grade
## Jenny  2   4 0.6931472     A
## Henny  5  25 1.6094379     B
## Bob    9  81 2.1972246    B+
## Tod   10 100 2.3025851    A-

class(M)

## [1] "matrix" "array"

class(D)

## [1] "data.frame"

Lists in R

Lists in R can hold different elements of any kind. Lists are very important when displaying the outputs of model fitting.

myList = list(A=1:5, B = matrix(1:8, nrow = 2, byrow = TRUE), C = "Hello!", D = data.frame(x=1:4, y = 9:6))
print(myList)

## $A
## [1] 1 2 3 4 5
## 
## $B
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## 
## $C
## [1] "Hello!"
## 
## $D
##   x y
## 1 1 9
## 2 2 8
## 3 3 7
## 4 4 6

class(myList)

## [1] "list"

Factors in R

To encode a vector as a factor with certain levels, use the R function “factor”. Levels must be elements in the vector. Labels are optional. When a vector is converted to a factor, it becomes a categorical variable. Factors are useful when you want to display character vectors in a non-alphabetical order (that is, an order you want).

v = c(1, 1, 3, 0, 1, 0, 3, 4, 4, 1, 2, 0, 1, 2)
# The following encodes the vector v as a factor with levels 0 through 4.
x = factor(v, levels = c(0, 1, 2, 3, 4), labels = c("zero", "one", "two", "three", "four")) 
y = factor(v) # By default, the levels are the different values in the natrual order
z = factor(v, levels = 4:0) # The order of levels can be set as desired.

levels(x) # This displays the levels of the factor x

## [1] "zero"  "one"   "two"   "three" "four"

levels(z)

## [1] "4" "3" "2" "1" "0"

class(v)

## [1] "numeric"

class(x)

## [1] "factor"

Subsetting in R

You can pull out part of elements of a data structure by some subsetting operations. There are three operators that can be used to extract subsets of R objects:

[: keeps the same data class,
[[: Applies to lists or data frames, and
$: Applies to lists or data frames.

x = (1:8)/10
x[3] # Extract the third element from vector x

## [1] 0.3

x[2:5] # Extract 2nd to 5th elements as a vector

## [1] 0.2 0.3 0.4 0.5

M = matrix(1:35, nrow = 5, byrow = TRUE)

M[4] # Extract the 4th element of M

## [1] 22

M[4, ] # Extract the 4th row as a vector

## [1] 22 23 24 25 26 27 28

M[, 4] # Extract the 4th column as a vector

## [1]  4 11 18 25 32

M[2, 5] # Extract the element at the intersection of second row and 5th column.

## [1] 12

M[2:4, ] # Extract the second to 4th rows

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]    8    9   10   11   12   13   14
## [2,]   15   16   17   18   19   20   21
## [3,]   22   23   24   25   26   27   28

M[, 2:4] # Extract the second to 4th columns

##      [,1] [,2] [,3]
## [1,]    2    3    4
## [2,]    9   10   11
## [3,]   16   17   18
## [4,]   23   24   25
## [5,]   30   31   32

y = data.frame(a=1:10, b = 5:14)

y[2] # Extract the second column and make it a new data frame (not useful).

##     b
## 1   5
## 2   6
## 3   7
## 4   8
## 5   9
## 6  10
## 7  11
## 8  12
## 9  13
## 10 14

y[[2]] # Extract the second column as a vector.

##  [1]  5  6  7  8  9 10 11 12 13 14

y[2, ] # Extract the second row as a vector.

##   a b
## 2 2 6

y[, 2] # Same as y[[2]]

##  [1]  5  6  7  8  9 10 11 12 13 14

y$b # Extract the "b" column as a vector. Dollar is good, but not necessary!

##  [1]  5  6  7  8  9 10 11 12 13 14

y$"b" # Same as y$b

##  [1]  5  6  7  8  9 10 11 12 13 14

y["b"] # Same as y[2], a new data frame (not useful).

##     b
## 1   5
## 2   6
## 3   7
## 4   8
## 5   9
## 6  10
## 7  11
## 8  12
## 9  13
## 10 14

y[,"b"] # Same as y[, 2]

##  [1]  5  6  7  8  9 10 11 12 13 14

y[["b"]] # Same as y["b"]

##  [1]  5  6  7  8  9 10 11 12 13 14

y[3:6, ] # Extract the 3rd to 6th rows as a new data frame

##   a  b
## 3 3  7
## 4 4  8
## 5 5  9
## 6 6 10

myList = list(A=1:5, B = matrix(1:8, nrow = 2, byrow = TRUE), C = "Hello!", D = data.frame(x=1:4, y = 9:6))
myList[4] # Extract the 4th element as a new list with only one element D.

## $D
##   x y
## 1 1 9
## 2 2 8
## 3 3 7
## 4 4 6

myList[[4]] # Not a list any more

##   x y
## 1 1 9
## 2 2 8
## 3 3 7
## 4 4 6

myList$D # Same as myList[[4]]

##   x y
## 1 1 9
## 2 2 8
## 3 3 7
## 4 4 6

Logical Expressions in R

Frequently, we need to compare two R objects whether they are the same or not, or one is greater.

x = 4
y = 5
z = (x > y)
print(z)

## [1] FALSE

w = (x <= y)
print(w)

## [1] TRUE

a = "abc"
b = "abC"
d = (a != b) # Is a not equal to b?
print(d)

## [1] TRUE

q = c(2, 9, 11, 45, 34, 8, 24, 15, 5, 7, 21)
r = 5
s = (q>r)
print(s)

##  [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE

u = TRUE
v = "TRUE"
class(u)

## [1] "logical"

class(v)

## [1] "character"

D2 = mtcars[mtcars$cyl == 4, c(2, 7, 9)] # A subset of mtcars for which cyl = 4.
D2

##                cyl  qsec am
## Datsun 710       4 18.61  1
## Merc 240D        4 20.00  0
## Merc 230         4 22.90  0
## Fiat 128         4 19.47  1
## Honda Civic      4 18.52  1
## Toyota Corolla   4 19.90  1
## Toyota Corona    4 20.01  0
## Fiat X1-9        4 18.90  1
## Porsche 914-2    4 16.70  1
## Lotus Europa     4 16.90  1
## Volvo 142E       4 18.60  1

t = q[q>10] # a vector containing values that are greater than 10
print(t)

## [1] 11 45 34 24 15 21

Conditional Branching in R

x = 15

# 4 branches: The number line is divided into 4 intervals: 
# (-infinity, 10), [10, 20), [20, 30), and [30, infinity)
if (x < 10){
  y = 2*x - 3
} else if (x <20){
  y = 3*x + 4
} else if (x < 30){
  y = 5*x - 12
} else{
  y = 10000
}

print(y)

## [1] 49

# 2 branches
States = c("MN", "FL", "IL", "CA")
state = "IL"
if (state %in% States){
  message = "Found it!"
} else{
  message = "Not found."
}

print(message)

## [1] "Found it!"

Looping in R

A loop in a programming language can perform the operation repeatedly. Like many other programming languages, R has for loops and while loops.

## The following gives a way of calculating the sum of the first 100 natural numbers.
sum = 0 # Initial value is 0
for (k in 1:100){
  sum = sum + k
}

print(sum)

## [1] 5050

## The following gives another way of calculating the sum of the first 100 natural numbers.
sum = 0
k = 1
while (k <= 100){
  sum = sum + k
  k = k + 1
}

print(sum)

## [1] 5050

## Or, we simply call a function to do the job
sum(1:100)

## [1] 5050

Functions in R

A function in R is an object so the R interpreter is able to pass control to the function, along with arguments that may be necessary for the function to accomplish the actions. The function in turn performs its task and returns control to the interpreter as well as any result which may be stored in other objects. (From https://www.tutorialspoint.com/r/r_functions.htm)

Built-in Functions in R

Lots of built-in functions are available in R. We have used quite many function above. To check out the details of a built-in function in R, type ?functionName in the R console.

A few very useful built-in functions are demonstrated below.

# Create a vector of 10 zeros
x = numeric(10)
x

##  [1] 0 0 0 0 0 0 0 0 0 0

x[4] = 1000 # Reset the number element of the numeric vector to 1000
x

##  [1]    0    0    0 1000    0    0    0    0    0    0

# Create a vector of 10 empty space characters
y = character(10)

print(mtcars) # "mtcars" is a data frame available from the "base" package

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

head(mtcars, n = 10) # Display only the first 10 rows of the data

##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

nrow(mtcars) # Display the number of rows in the data

## [1] 32

names(mtcars) # Display the column names of data

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

colnames(mtcars) # column names

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

rownames(mtcars) # row names

##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"

dimnames(mtcars) # Both row and column names

## [[1]]
##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"         
## 
## [[2]]
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

str(mtcars) # Display the structure of the mtcars data frame

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

class(mtcars) # The class of the data

## [1] "data.frame"

summary(mtcars) # Summarize each column of the mtcars data frame

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

plot(mtcars) # Scatterplot matrix

# Add a new column to a data frame
D = mtcars # Create a copy
D$log.mpg = log(D$mpg)
D$sq.wt = D$wt^2
D

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
##                      log.mpg     sq.wt
## Mazda RX4           3.044522  6.864400
## Mazda RX4 Wag       3.044522  8.265625
## Datsun 710          3.126761  5.382400
## Hornet 4 Drive      3.063391 10.336225
## Hornet Sportabout   2.928524 11.833600
## Valiant             2.895912 11.971600
## Duster 360          2.660260 12.744900
## Merc 240D           3.194583 10.176100
## Merc 230            3.126761  9.922500
## Merc 280            2.954910 11.833600
## Merc 280C           2.879198 11.833600
## Merc 450SE          2.797281 16.564900
## Merc 450SL          2.850707 13.912900
## Merc 450SLC         2.721295 14.288400
## Cadillac Fleetwood  2.341806 27.562500
## Lincoln Continental 2.341806 29.419776
## Chrysler Imperial   2.687847 28.569025
## Fiat 128            3.478158  4.840000
## Honda Civic         3.414443  2.608225
## Toyota Corolla      3.523415  3.367225
## Toyota Corona       3.068053  6.076225
## Dodge Challenger    2.740840 12.390400
## AMC Javelin         2.721295 11.799225
## Camaro Z28          2.587764 14.745600
## Pontiac Firebird    2.954910 14.784025
## Fiat X1-9           3.306887  3.744225
## Porsche 914-2       3.258097  4.579600
## Lotus Europa        3.414443  2.289169
## Ford Pantera L      2.760010 10.048900
## Ferrari Dino        2.980619  7.672900
## Maserati Bora       2.708050 12.744900
## Volvo 142E          3.063391  7.728400

# Equivalently
library(dplyr)
D = mutate(mtcars, 
           log.mpg = log(mpg),
           sq.wt = wt^2
          )
D

##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb  log.mpg     sq.wt
## 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4 3.044522  6.864400
## 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4 3.044522  8.265625
## 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1 3.126761  5.382400
## 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1 3.063391 10.336225
## 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2 2.928524 11.833600
## 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1 2.895912 11.971600
## 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4 2.660260 12.744900
## 8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2 3.194583 10.176100
## 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2 3.126761  9.922500
## 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4 2.954910 11.833600
## 11 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4 2.879198 11.833600
## 12 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3 2.797281 16.564900
## 13 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3 2.850707 13.912900
## 14 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3 2.721295 14.288400
## 15 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4 2.341806 27.562500
## 16 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4 2.341806 29.419776
## 17 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4 2.687847 28.569025
## 18 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1 3.478158  4.840000
## 19 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2 3.414443  2.608225
## 20 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1 3.523415  3.367225
## 21 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1 3.068053  6.076225
## 22 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2 2.740840 12.390400
## 23 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2 2.721295 11.799225
## 24 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4 2.587764 14.745600
## 25 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2 2.954910 14.784025
## 26 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1 3.306887  3.744225
## 27 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2 3.258097  4.579600
## 28 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2 3.414443  2.289169
## 29 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4 2.760010 10.048900
## 30 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6 2.980619  7.672900
## 31 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8 2.708050 12.744900
## 32 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2 3.063391  7.728400

# Rename columns of a data frame
D = mtcars
D = rename(D,
       "mile per gallon" = mpg,
       "cylinder" = cyl,
       "horse power" = hp
      )
D

##                     mile per gallon cylinder  disp horse power drat    wt  qsec
## Mazda RX4                      21.0        6 160.0         110 3.90 2.620 16.46
## Mazda RX4 Wag                  21.0        6 160.0         110 3.90 2.875 17.02
## Datsun 710                     22.8        4 108.0          93 3.85 2.320 18.61
## Hornet 4 Drive                 21.4        6 258.0         110 3.08 3.215 19.44
## Hornet Sportabout              18.7        8 360.0         175 3.15 3.440 17.02
## Valiant                        18.1        6 225.0         105 2.76 3.460 20.22
## Duster 360                     14.3        8 360.0         245 3.21 3.570 15.84
## Merc 240D                      24.4        4 146.7          62 3.69 3.190 20.00
## Merc 230                       22.8        4 140.8          95 3.92 3.150 22.90
## Merc 280                       19.2        6 167.6         123 3.92 3.440 18.30
## Merc 280C                      17.8        6 167.6         123 3.92 3.440 18.90
## Merc 450SE                     16.4        8 275.8         180 3.07 4.070 17.40
## Merc 450SL                     17.3        8 275.8         180 3.07 3.730 17.60
## Merc 450SLC                    15.2        8 275.8         180 3.07 3.780 18.00
## Cadillac Fleetwood             10.4        8 472.0         205 2.93 5.250 17.98
## Lincoln Continental            10.4        8 460.0         215 3.00 5.424 17.82
## Chrysler Imperial              14.7        8 440.0         230 3.23 5.345 17.42
## Fiat 128                       32.4        4  78.7          66 4.08 2.200 19.47
## Honda Civic                    30.4        4  75.7          52 4.93 1.615 18.52
## Toyota Corolla                 33.9        4  71.1          65 4.22 1.835 19.90
## Toyota Corona                  21.5        4 120.1          97 3.70 2.465 20.01
## Dodge Challenger               15.5        8 318.0         150 2.76 3.520 16.87
## AMC Javelin                    15.2        8 304.0         150 3.15 3.435 17.30
## Camaro Z28                     13.3        8 350.0         245 3.73 3.840 15.41
## Pontiac Firebird               19.2        8 400.0         175 3.08 3.845 17.05
## Fiat X1-9                      27.3        4  79.0          66 4.08 1.935 18.90
## Porsche 914-2                  26.0        4 120.3          91 4.43 2.140 16.70
## Lotus Europa                   30.4        4  95.1         113 3.77 1.513 16.90
## Ford Pantera L                 15.8        8 351.0         264 4.22 3.170 14.50
## Ferrari Dino                   19.7        6 145.0         175 3.62 2.770 15.50
## Maserati Bora                  15.0        8 301.0         335 3.54 3.570 14.60
## Volvo 142E                     21.4        4 121.0         109 4.11 2.780 18.60
##                     vs am gear carb
## Mazda RX4            0  1    4    4
## Mazda RX4 Wag        0  1    4    4
## Datsun 710           1  1    4    1
## Hornet 4 Drive       1  0    3    1
## Hornet Sportabout    0  0    3    2
## Valiant              1  0    3    1
## Duster 360           0  0    3    4
## Merc 240D            1  0    4    2
## Merc 230             1  0    4    2
## Merc 280             1  0    4    4
## Merc 280C            1  0    4    4
## Merc 450SE           0  0    3    3
## Merc 450SL           0  0    3    3
## Merc 450SLC          0  0    3    3
## Cadillac Fleetwood   0  0    3    4
## Lincoln Continental  0  0    3    4
## Chrysler Imperial    0  0    3    4
## Fiat 128             1  1    4    1
## Honda Civic          1  1    4    2
## Toyota Corolla       1  1    4    1
## Toyota Corona        1  0    3    1
## Dodge Challenger     0  0    3    2
## AMC Javelin          0  0    3    2
## Camaro Z28           0  0    3    4
## Pontiac Firebird     0  0    3    2
## Fiat X1-9            1  1    4    1
## Porsche 914-2        0  1    5    2
## Lotus Europa         1  1    5    2
## Ford Pantera L       0  1    5    4
## Ferrari Dino         0  1    5    6
## Maserati Bora        0  1    5    8
## Volvo 142E           1  1    4    2

# Alternatively
D = mtcars
names(D)[c(1, 2, 4)] = c("mile per gallon", "cylinder", "horse power")
D

##                     mile per gallon cylinder  disp horse power drat    wt  qsec
## Mazda RX4                      21.0        6 160.0         110 3.90 2.620 16.46
## Mazda RX4 Wag                  21.0        6 160.0         110 3.90 2.875 17.02
## Datsun 710                     22.8        4 108.0          93 3.85 2.320 18.61
## Hornet 4 Drive                 21.4        6 258.0         110 3.08 3.215 19.44
## Hornet Sportabout              18.7        8 360.0         175 3.15 3.440 17.02
## Valiant                        18.1        6 225.0         105 2.76 3.460 20.22
## Duster 360                     14.3        8 360.0         245 3.21 3.570 15.84
## Merc 240D                      24.4        4 146.7          62 3.69 3.190 20.00
## Merc 230                       22.8        4 140.8          95 3.92 3.150 22.90
## Merc 280                       19.2        6 167.6         123 3.92 3.440 18.30
## Merc 280C                      17.8        6 167.6         123 3.92 3.440 18.90
## Merc 450SE                     16.4        8 275.8         180 3.07 4.070 17.40
## Merc 450SL                     17.3        8 275.8         180 3.07 3.730 17.60
## Merc 450SLC                    15.2        8 275.8         180 3.07 3.780 18.00
## Cadillac Fleetwood             10.4        8 472.0         205 2.93 5.250 17.98
## Lincoln Continental            10.4        8 460.0         215 3.00 5.424 17.82
## Chrysler Imperial              14.7        8 440.0         230 3.23 5.345 17.42
## Fiat 128                       32.4        4  78.7          66 4.08 2.200 19.47
## Honda Civic                    30.4        4  75.7          52 4.93 1.615 18.52
## Toyota Corolla                 33.9        4  71.1          65 4.22 1.835 19.90
## Toyota Corona                  21.5        4 120.1          97 3.70 2.465 20.01
## Dodge Challenger               15.5        8 318.0         150 2.76 3.520 16.87
## AMC Javelin                    15.2        8 304.0         150 3.15 3.435 17.30
## Camaro Z28                     13.3        8 350.0         245 3.73 3.840 15.41
## Pontiac Firebird               19.2        8 400.0         175 3.08 3.845 17.05
## Fiat X1-9                      27.3        4  79.0          66 4.08 1.935 18.90
## Porsche 914-2                  26.0        4 120.3          91 4.43 2.140 16.70
## Lotus Europa                   30.4        4  95.1         113 3.77 1.513 16.90
## Ford Pantera L                 15.8        8 351.0         264 4.22 3.170 14.50
## Ferrari Dino                   19.7        6 145.0         175 3.62 2.770 15.50
## Maserati Bora                  15.0        8 301.0         335 3.54 3.570 14.60
## Volvo 142E                     21.4        4 121.0         109 4.11 2.780 18.60
##                     vs am gear carb
## Mazda RX4            0  1    4    4
## Mazda RX4 Wag        0  1    4    4
## Datsun 710           1  1    4    1
## Hornet 4 Drive       1  0    3    1
## Hornet Sportabout    0  0    3    2
## Valiant              1  0    3    1
## Duster 360           0  0    3    4
## Merc 240D            1  0    4    2
## Merc 230             1  0    4    2
## Merc 280             1  0    4    4
## Merc 280C            1  0    4    4
## Merc 450SE           0  0    3    3
## Merc 450SL           0  0    3    3
## Merc 450SLC          0  0    3    3
## Cadillac Fleetwood   0  0    3    4
## Lincoln Continental  0  0    3    4
## Chrysler Imperial    0  0    3    4
## Fiat 128             1  1    4    1
## Honda Civic          1  1    4    2
## Toyota Corolla       1  1    4    1
## Toyota Corona        1  0    3    1
## Dodge Challenger     0  0    3    2
## AMC Javelin          0  0    3    2
## Camaro Z28           0  0    3    4
## Pontiac Firebird     0  0    3    2
## Fiat X1-9            1  1    4    1
## Porsche 914-2        0  1    5    2
## Lotus Europa         1  1    5    2
## Ford Pantera L       0  1    5    4
## Ferrari Dino         0  1    5    6
## Maserati Bora        0  1    5    8
## Volvo 142E           1  1    4    2

# Subsetting a data frame by selecting some columns
library(dplyr)
select(mtcars, disp, wt, hp)

##                      disp    wt  hp
## Mazda RX4           160.0 2.620 110
## Mazda RX4 Wag       160.0 2.875 110
## Datsun 710          108.0 2.320  93
## Hornet 4 Drive      258.0 3.215 110
## Hornet Sportabout   360.0 3.440 175
## Valiant             225.0 3.460 105
## Duster 360          360.0 3.570 245
## Merc 240D           146.7 3.190  62
## Merc 230            140.8 3.150  95
## Merc 280            167.6 3.440 123
## Merc 280C           167.6 3.440 123
## Merc 450SE          275.8 4.070 180
## Merc 450SL          275.8 3.730 180
## Merc 450SLC         275.8 3.780 180
## Cadillac Fleetwood  472.0 5.250 205
## Lincoln Continental 460.0 5.424 215
## Chrysler Imperial   440.0 5.345 230
## Fiat 128             78.7 2.200  66
## Honda Civic          75.7 1.615  52
## Toyota Corolla       71.1 1.835  65
## Toyota Corona       120.1 2.465  97
## Dodge Challenger    318.0 3.520 150
## AMC Javelin         304.0 3.435 150
## Camaro Z28          350.0 3.840 245
## Pontiac Firebird    400.0 3.845 175
## Fiat X1-9            79.0 1.935  66
## Porsche 914-2       120.3 2.140  91
## Lotus Europa         95.1 1.513 113
## Ford Pantera L      351.0 3.170 264
## Ferrari Dino        145.0 2.770 175
## Maserati Bora       301.0 3.570 335
## Volvo 142E          121.0 2.780 109

# Alternatively
mtcars[ , c(3, 6, 4)]

##                      disp    wt  hp
## Mazda RX4           160.0 2.620 110
## Mazda RX4 Wag       160.0 2.875 110
## Datsun 710          108.0 2.320  93
## Hornet 4 Drive      258.0 3.215 110
## Hornet Sportabout   360.0 3.440 175
## Valiant             225.0 3.460 105
## Duster 360          360.0 3.570 245
## Merc 240D           146.7 3.190  62
## Merc 230            140.8 3.150  95
## Merc 280            167.6 3.440 123
## Merc 280C           167.6 3.440 123
## Merc 450SE          275.8 4.070 180
## Merc 450SL          275.8 3.730 180
## Merc 450SLC         275.8 3.780 180
## Cadillac Fleetwood  472.0 5.250 205
## Lincoln Continental 460.0 5.424 215
## Chrysler Imperial   440.0 5.345 230
## Fiat 128             78.7 2.200  66
## Honda Civic          75.7 1.615  52
## Toyota Corolla       71.1 1.835  65
## Toyota Corona       120.1 2.465  97
## Dodge Challenger    318.0 3.520 150
## AMC Javelin         304.0 3.435 150
## Camaro Z28          350.0 3.840 245
## Pontiac Firebird    400.0 3.845 175
## Fiat X1-9            79.0 1.935  66
## Porsche 914-2       120.3 2.140  91
## Lotus Europa         95.1 1.513 113
## Ford Pantera L      351.0 3.170 264
## Ferrari Dino        145.0 2.770 175
## Maserati Bora       301.0 3.570 335
## Volvo 142E          121.0 2.780 109

# or 
mtcars[c(3, 6, 4)]

##                      disp    wt  hp
## Mazda RX4           160.0 2.620 110
## Mazda RX4 Wag       160.0 2.875 110
## Datsun 710          108.0 2.320  93
## Hornet 4 Drive      258.0 3.215 110
## Hornet Sportabout   360.0 3.440 175
## Valiant             225.0 3.460 105
## Duster 360          360.0 3.570 245
## Merc 240D           146.7 3.190  62
## Merc 230            140.8 3.150  95
## Merc 280            167.6 3.440 123
## Merc 280C           167.6 3.440 123
## Merc 450SE          275.8 4.070 180
## Merc 450SL          275.8 3.730 180
## Merc 450SLC         275.8 3.780 180
## Cadillac Fleetwood  472.0 5.250 205
## Lincoln Continental 460.0 5.424 215
## Chrysler Imperial   440.0 5.345 230
## Fiat 128             78.7 2.200  66
## Honda Civic          75.7 1.615  52
## Toyota Corolla       71.1 1.835  65
## Toyota Corona       120.1 2.465  97
## Dodge Challenger    318.0 3.520 150
## AMC Javelin         304.0 3.435 150
## Camaro Z28          350.0 3.840 245
## Pontiac Firebird    400.0 3.845 175
## Fiat X1-9            79.0 1.935  66
## Porsche 914-2       120.3 2.140  91
## Lotus Europa         95.1 1.513 113
## Ford Pantera L      351.0 3.170 264
## Ferrari Dino        145.0 2.770 175
## Maserati Bora       301.0 3.570 335
## Volvo 142E          121.0 2.780 109

# or 
mtcars[, c("disp", "wt", "hp")]

##                      disp    wt  hp
## Mazda RX4           160.0 2.620 110
## Mazda RX4 Wag       160.0 2.875 110
## Datsun 710          108.0 2.320  93
## Hornet 4 Drive      258.0 3.215 110
## Hornet Sportabout   360.0 3.440 175
## Valiant             225.0 3.460 105
## Duster 360          360.0 3.570 245
## Merc 240D           146.7 3.190  62
## Merc 230            140.8 3.150  95
## Merc 280            167.6 3.440 123
## Merc 280C           167.6 3.440 123
## Merc 450SE          275.8 4.070 180
## Merc 450SL          275.8 3.730 180
## Merc 450SLC         275.8 3.780 180
## Cadillac Fleetwood  472.0 5.250 205
## Lincoln Continental 460.0 5.424 215
## Chrysler Imperial   440.0 5.345 230
## Fiat 128             78.7 2.200  66
## Honda Civic          75.7 1.615  52
## Toyota Corolla       71.1 1.835  65
## Toyota Corona       120.1 2.465  97
## Dodge Challenger    318.0 3.520 150
## AMC Javelin         304.0 3.435 150
## Camaro Z28          350.0 3.840 245
## Pontiac Firebird    400.0 3.845 175
## Fiat X1-9            79.0 1.935  66
## Porsche 914-2       120.3 2.140  91
## Lotus Europa         95.1 1.513 113
## Ford Pantera L      351.0 3.170 264
## Ferrari Dino        145.0 2.770 175
## Maserati Bora       301.0 3.570 335
## Volvo 142E          121.0 2.780 109

# or
mtcars[c("disp", "wt", "hp")]

##                      disp    wt  hp
## Mazda RX4           160.0 2.620 110
## Mazda RX4 Wag       160.0 2.875 110
## Datsun 710          108.0 2.320  93
## Hornet 4 Drive      258.0 3.215 110
## Hornet Sportabout   360.0 3.440 175
## Valiant             225.0 3.460 105
## Duster 360          360.0 3.570 245
## Merc 240D           146.7 3.190  62
## Merc 230            140.8 3.150  95
## Merc 280            167.6 3.440 123
## Merc 280C           167.6 3.440 123
## Merc 450SE          275.8 4.070 180
## Merc 450SL          275.8 3.730 180
## Merc 450SLC         275.8 3.780 180
## Cadillac Fleetwood  472.0 5.250 205
## Lincoln Continental 460.0 5.424 215
## Chrysler Imperial   440.0 5.345 230
## Fiat 128             78.7 2.200  66
## Honda Civic          75.7 1.615  52
## Toyota Corolla       71.1 1.835  65
## Toyota Corona       120.1 2.465  97
## Dodge Challenger    318.0 3.520 150
## AMC Javelin         304.0 3.435 150
## Camaro Z28          350.0 3.840 245
## Pontiac Firebird    400.0 3.845 175
## Fiat X1-9            79.0 1.935  66
## Porsche 914-2       120.3 2.140  91
## Lotus Europa         95.1 1.513 113
## Ford Pantera L      351.0 3.170 264
## Ferrari Dino        145.0 2.770 175
## Maserati Bora       301.0 3.570 335
## Volvo 142E          121.0 2.780 109

# We can also deselect some columns to form a subset
select(mtcars, -am, -carb)

##                      mpg cyl  disp  hp drat    wt  qsec vs gear
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1    4
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1    3
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0    3
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1    3
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0    3
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1    4
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1    4
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0    3
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0    3
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0    3
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1    4
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1    4
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1    4
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1    3
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0    3
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0    3
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0    3
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0    3
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1    4
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0    5
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1    5
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0    5
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0    5
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0    5
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1    4

# Subsetting a data frame by selecting some rows meeting some conditions
D = subset(mtcars, cyl == 6 & gear %in% c(3, 5) & hp > 100)
D

##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Valiant        18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
## Ferrari Dino   19.7   6  145 175 3.62 2.770 15.50  0  1    5    6

# or
D = mtcars[mtcars$cyl == 6 & mtcars$gear %in% c(3, 5) & mtcars$hp > 100, ]
D

##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Valiant        18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
## Ferrari Dino   19.7   6  145 175 3.62 2.770 15.50  0  1    5    6

str(iris) # The data frame "iris" is also from the "base" package

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

plot(iris)

head(mpg) # Display the first 6 rows of data frame "mpg"

## # A tibble: 6 x 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

table(mpg$class) # Tabulate the class column in the data frame "mpg": a frequency table

## 
##    2seater    compact    midsize    minivan     pickup subcompact        suv 
##          5         47         41         11         33         35         62

# Create a random sample from a finite population with known elements
population = c(12, 56, 87, 43, 56, 54, 82, 34, 61, 52, 84, 97, 37, 28, 39)
y = sample(x = population, size = 5, replace = FALSE) # Sampling 5 values from the population without replacement
y

## [1] 43 54 39 52 61

# Create a sample from a discrete population with a known distribution. 
z = sample(1:3, size = 1000, prob = c(0.7, 0.15, 0.15), replace = TRUE) # The sampling will have to be w/ replacement
table(z)/1000*100 # Check the quality of the sample to see how close it is to the population

## z
##    1    2    3 
## 68.2 16.7 15.1

# A question: How can we write a function that randomly choose a given number of rows 
# from an existing data frame to create a new data frame? This is called data partition.

# Group your data by one or more categorical variables and then summarize the grouped data
library(dplyr) # Load the package
mySummary = mtcars %>% group_by(vs, am) %>% summarise(n = n())

## `summarise()` regrouping output by 'vs' (override with `.groups` argument)

mySummary

## # A tibble: 4 x 3
## # Groups:   vs [2]
##      vs    am     n
##   <dbl> <dbl> <int>
## 1     0     0    12
## 2     0     1     6
## 3     1     0     7
## 4     1     1     7

# Remove all the objects we created so far. This can be very useful!
rm(list = ls()) 

# Round values
x = c(100.45, 67.35, 78.82, 98.43, - 67.41, -84.92)
round(x, 1) # round to one decimal place

## [1] 100.4  67.3  78.8  98.4 -67.4 -84.9

round(x, 0) # round to the nearest whole number

## [1] 100  67  79  98 -67 -85

round(x, -1) # round to the nearest 10th

## [1] 100  70  80 100 -70 -80

# Paste a few strings with a separator.
paste("Tomorrow is ", Sys.Date() + 1, ", the due date for ", "project #", 15, ". ", "Don't miss it!", sep = "")

## [1] "Tomorrow is 2021-02-02, the due date for project #15. Don't miss it!"

# The switch() function for conditional execution
x = c(45, 78, 93, 25, 54, 80)
stats = "Sd"
switch(stats,
       Mean = mean(x),
       SD = sd(x),
       Median = median(x),
       Summary = summary(x),
       cat("Sorry, it goes beyond my capacity.")
)

## Sorry, it goes beyond my capacity.

# Handling dates
y = c(34, 56, 61, 78, 84, 92, 100, 120, 125)
x = c("1990-05-01", "1990-05-02", "1990-05-03", "1990-05-04", "1990-05-05", "1990-05-06", "1990-05-07", "1990-05-08", "1990-05-09")
dx = as.Date(x)
dx

## [1] "1990-05-01" "1990-05-02" "1990-05-03" "1990-05-04" "1990-05-05"
## [6] "1990-05-06" "1990-05-07" "1990-05-08" "1990-05-09"

class(x)

## [1] "character"

class(dx)

## [1] "Date"

plot(y~dx, xlab = "Date")

D = Sys.Date() # Extract the date of today
weekdays(D) # Extract the week day

## [1] "Monday"

months(D)

## [1] "February"

quarters(D)

## [1] "Q1"

julian(D) # Number of days since the origin (1970-01-01)

## [1] 18659
## attr(,"origin")
## [1] "1970-01-01"

julian(D, origin = as.Date("2000-07-01")) # Number of days since the origin (2000-07-01)

## [1] 7520
## attr(,"origin")
## [1] "2000-07-01"

User-defined Functions in R

R users can also define their own functions. The structure of a user-defined function is R looks like the following:

  # functionName = function(a list of parameters/arguments separated by comma){
  #    The function body with the last line being the returned value (can be any data structure)
  # }

The different parts of a function are

Function Name- This is the actual name of the function. It is stored in R environment as an object with this name.
Arguments- An argument is a placeholder. When a function is invoked, you pass a value to the argument. Arguments are optional; that is, a function may contain no arguments. Also arguments can have default values.
Function Body- The function body contains a collection of statements that defines what the function does.
Return Value- The return value of a function is the last expression in the function body to be evaluated.

(From https://www.tutorialspoint.com/r/r_functions.htm)

f1 = function(x){
  2*x^2 - 3/x +1
}

# A better one that handles abnormality
f2 = function(x){
  if (x != 0) {
    2*x^2 - 3/x +1
  } else {
    cat("Can't be done due to zero denominator!\n", "Please use a non-zero input.")
  }
}

# A function for a simple summary of a numeric sample
mySummary = function(x){
  Mean = mean(x)
  Median = median(x)
  Std = sd(x)
  
  list(Mean = Mean, Median = Median, "Standard Deviation" = Std)
  
}

# A function that prints the elements of a vector reversely
rprint = function(x){
  n = length(x)
  v = NULL
  for (i in 1:n){
    v[i] = x[n-i+1]
  }
  v
}

rprint(1:9)

## [1] 9 8 7 6 5 4 3 2 1

rprint(letters)

##  [1] "z" "y" "x" "w" "v" "u" "t" "s" "r" "q" "p" "o" "n" "m" "l" "k" "j" "i" "h"
## [20] "g" "f" "e" "d" "c" "b" "a"

# There is a built-in function in R that gives the reversal
rev(1:9)

## [1] 9 8 7 6 5 4 3 2 1

rev(letters)

##  [1] "z" "y" "x" "w" "v" "u" "t" "s" "r" "q" "p" "o" "n" "m" "l" "k" "j" "i" "h"
## [20] "g" "f" "e" "d" "c" "b" "a"

# A user-defined function that partitions a data frame into training data, validation data, and test data
partition = function(D, prop = c(0.6, 0.2, 0.2)){ # D is a data frame to partition
  n = nrow(D)
  idx = sample(x = 1:n, size = n, replace = FALSE) # Shuffle the original rows of D
  shuffled = D[idx, ]
  n1 = round(n * prop[1])
  n2 = round(n * prop[2])
  n3 = n - n1 - n2 # Can I do n3 = round(n * prop[3])?
  training = shuffled[1:n1, ]
  validation = shuffled[(n1+1):(n1+n2), ]
  test = shuffled[(n1 + n2 + 1):n, ]
  
  L = list(training = training, validation = validation, test = test)
  return(L)
}

partition(mtcars, c(0.7, 0.15, 0.15))

## $training
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## 
## $validation
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Merc 450SE     16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## AMC Javelin    15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## 
## $test
##                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Datsun 710       22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Mazda RX4 Wag    21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Fiat 128         32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Merc 240D        24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2

The Pipe Operator “%>%” in R

The pipe operator, %>% , comes from the “magrittr” package. The point of pipe is to help you write human-friendly code.

In mathematics, you can make a composite function by doing something like y = f(g(h(x))), which is equivalent to

\[x \rightarrow h() \rightarrow g() \rightarrow f() = y\] The above process involves 4 steps:

Step 1: Input “x” into the function “h”.

Step 2: Input the result “h(x)” from step 1 into the function “g”.

Step 3: Input the result “g(h(x))” from step 2 into the function “f”.

Step 4: The output is “f(g(h(x)))” and assigned to “y”.

In the “magrittr” package, the right arrow $" is represented by “%>%”. Here are examples.

x = c( 23, 45, 34, 78, 12, 56)
mean(x)

## [1] 41.33333

# The following 3 lines of code are each equivalent to the previous line
x %>% mean() # that is, we can factor x out!

## [1] 41.33333

x %>%mean

## [1] 41.33333

x %>% mean(.) # "." is a placeholder

## [1] 41.33333

x %>% sqrt() %>% sum() # A chain rule: just for fun

## [1] 37.11416

# The following gives a more realistic example.
library(dplyr) # The count() function is from this package.
print(starwars) # A dataset from the package

## # A tibble: 87 x 14
##    name  height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke…    172    77 blond      fair       blue            19   male  mascu…
##  2 C-3PO    167    75 <NA>       gold       yellow         112   none  mascu…
##  3 R2-D2     96    32 <NA>       white, bl… red             33   none  mascu…
##  4 Dart…    202   136 none       white      yellow          41.9 male  mascu…
##  5 Leia…    150    49 brown      light      brown           19   fema… femin…
##  6 Owen…    178   120 brown, gr… light      blue            52   male  mascu…
##  7 Beru…    165    75 brown      light      blue            47   fema… femin…
##  8 R5-D4     97    32 <NA>       white, red red             NA   none  mascu…
##  9 Bigg…    183    84 black      light      brown           24   male  mascu…
## 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

# The column "species" is a categorical variable in the data "starwars".
# The following gets its distribution, which is discrete.
D1 <- starwars %>% count(species) 
D1

## # A tibble: 38 x 2
##    species       n
##    <chr>     <int>
##  1 Aleena        1
##  2 Besalisk      1
##  3 Cerean        1
##  4 Chagrian      1
##  5 Clawdite      1
##  6 Droid         6
##  7 Dug           1
##  8 Ewok          1
##  9 Geonosian     1
## 10 Gungan        3
## # … with 28 more rows

# Plot the discrete distribution
barplot(height = D1$n, names = D1$species)
with(D1, barplot(height = n, names = species)) # Alternatively

# Sort the distribution table by the frequency (column "n")
D2 <- starwars %>% count(species, sort = TRUE)
D2

## # A tibble: 38 x 2
##    species      n
##    <chr>    <int>
##  1 Human       35
##  2 Droid        6
##  3 <NA>         4
##  4 Gungan       3
##  5 Kaminoan     2
##  6 Mirialan     2
##  7 Twi'lek      2
##  8 Wookiee      2
##  9 Zabrak       2
## 10 Aleena       1
## # … with 28 more rows

# Plot the discrete distribution
bp = barplot(height = D2$n, names = D2$species, ylim = c(0, max(D2$n)*1.1), las = 2) # A sorted barchart, called the Pareto chart

with(D2, barplot(height = n, names = species)) # Alternatively

# The following adds labels: above (pos = 3) bars by 10% of the size of the character width
text(bp, D2$n*0.9, labels = D2$n, pos = 3, offset = 0.1, col = "red")
title("Distribution of Species in Starwars", col.main = "blue", cex.main = 2, sub = "(data courtesy of xyz)")

# Another way to plot: just for illustration and not recommended
starwars %>% .$species %>% table() %>% sort(., decreasing = TRUE) %>% barplot(ylim = c(0, max(D2$n)*1.1), las = 2) %>% text(., D2$n, labels = D2$n, pos = 3, offset = 0.1, col = "red")

# Joint distribution of "sex" and "gender" and sort by the frequency (column "n")
D3 <- starwars %>% count(sex, gender, sort = TRUE)
D3

## # A tibble: 6 x 3
##   sex            gender        n
##   <chr>          <chr>     <int>
## 1 male           masculine    60
## 2 female         feminine     16
## 3 none           masculine     5
## 4 <NA>           <NA>          4
## 5 hermaphroditic masculine     1
## 6 none           feminine      1

Missing Values

Your data may have missing values. In R, missing values are indicated by “NA”. Missing values can be removed or imputed depending on the context. There are lots of research done on missing values.

x = c(2, 6, 9, NA, 10, 23, NA, 30)

print(x)

## [1]  2  6  9 NA 10 23 NA 30

mean(x) # Produces NA

## [1] NA

sd(x) # Produces NA

## [1] NA

summary(x) # NA's handled

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00    6.75    9.50   13.33   19.75   30.00       2

# Handling missing values by simply removing them with the "na.rm" option.
mean(x, na.rm = TRUE)

## [1] 13.33333

sd(x, na.rm = TRUE)

## [1] 10.80123

# Remove the missing values to create a new vector
y = as.numeric(na.omit(x))
print(y)

## [1]  2  6  9 10 23 30

mean(y)

## [1] 13.33333

D1 = data.frame(x = c(1:10, NA, 12:15), y = c(2, 4, 5, 1, 0, NA, 7.2, NA, 10, 13.4, 15.2, NA, 18.5, 11, 20.5))
D2 = na.omit(D1) # Remove rows with a missing value

D1

##     x    y
## 1   1  2.0
## 2   2  4.0
## 3   3  5.0
## 4   4  1.0
## 5   5  0.0
## 6   6   NA
## 7   7  7.2
## 8   8   NA
## 9   9 10.0
## 10 10 13.4
## 11 NA 15.2
## 12 12   NA
## 13 13 18.5
## 14 14 11.0
## 15 15 20.5

D2

##     x    y
## 1   1  2.0
## 2   2  4.0
## 3   3  5.0
## 4   4  1.0
## 5   5  0.0
## 7   7  7.2
## 9   9 10.0
## 10 10 13.4
## 13 13 18.5
## 14 14 11.0
## 15 15 20.5

Reading External Data into R

In R, we can read data from files stored outside the R environment. We can also write data into files which will be stored and accessed by the operating system. R can read and write into various file formats like csv, excel, etc.

If the data are on the local computer, it is convenient if they are stored in the same working directory where most of your R files are stored.

You can check which directory the R workspace is pointing to using the “getwd” function. You can also set a new working directory using “setwd” function.

getwd() # Get the working directory

## [1] "/Users/home/Documents/Zhang/Stat415.515.615"

#setwd("/Users/home/Documents/Zhang/Stat415.515.615") # Set it to a new one
#getwd()

# MyData is a data frame made from the raw csv data
myData = read.csv("Sales.csv") 
head(myData, n = 20) # Display only the first 20 rows of the data frame

##    Item_Identifier Item_Weight Item_Fat_Content Item_Visibility
## 1            FDW58      20.750          Low Fat     0.007564836
## 2            FDW14       8.300              reg     0.038427677
## 3            NCN55      14.600          Low Fat     0.099574908
## 4            FDQ58       7.315          Low Fat     0.015388393
## 5            FDY38          NA          Regular     0.118599314
## 6            FDH56       9.800          Regular     0.063817206
## 7            FDL48      19.350          Regular     0.082601537
## 8            FDC48          NA          Low Fat     0.015782495
## 9            FDN33       6.305          Regular     0.123365446
## 10           FDA36       5.985          Low Fat     0.005698435
## 11           FDT44      16.600          Low Fat     0.103569075
## 12           FDQ56       6.590          Low Fat     0.105811470
## 13           NCC54          NA          Low Fat     0.171079215
## 14           FDU11       4.785          Low Fat     0.092737611
## 15           DRL59      16.750               LF     0.021206464
## 16           FDM24       6.135          Regular     0.079450700
## 17           FDI57      19.850          Low Fat     0.054135210
## 18           DRC12      17.850          Low Fat     0.037980963
## 19           NCM42          NA          Low Fat     0.028184344
## 20           FDA46      13.600          Low Fat     0.196897637
##                Item_Type Item_MRP Outlet_Identifier Outlet_Establishment_Year
## 1            Snack Foods 107.8622            OUT049                      1999
## 2                  Dairy  87.3198            OUT017                      2007
## 3                 Others 241.7538            OUT010                      1998
## 4            Snack Foods 155.0340            OUT017                      2007
## 5                  Dairy 234.2300            OUT027                      1985
## 6  Fruits and Vegetables 117.1492            OUT046                      1997
## 7           Baking Goods  50.1034            OUT018                      2009
## 8           Baking Goods  81.0592            OUT027                      1985
## 9            Snack Foods  95.7436            OUT045                      2002
## 10          Baking Goods 186.8924            OUT017                      2007
## 11 Fruits and Vegetables 118.3466            OUT017                      2007
## 12 Fruits and Vegetables  85.3908            OUT045                      2002
## 13    Health and Hygiene 240.4196            OUT019                      1985
## 14                Breads 122.3098            OUT049                      1999
## 15           Hard Drinks  52.0298            OUT013                      1987
## 16          Baking Goods 151.6366            OUT049                      1999
## 17               Seafood 198.7768            OUT045                      2002
## 18           Soft Drinks 192.2188            OUT018                      2009
## 19             Household 109.6912            OUT027                      1985
## 20           Snack Foods 193.7136            OUT010                      1998
##    Outlet_Size Outlet_Location_Type       Outlet_Type
## 1       Medium               Tier 1 Supermarket Type1
## 2                            Tier 2 Supermarket Type1
## 3                            Tier 3     Grocery Store
## 4                            Tier 2 Supermarket Type1
## 5       Medium               Tier 3 Supermarket Type3
## 6        Small               Tier 1 Supermarket Type1
## 7       Medium               Tier 3 Supermarket Type2
## 8       Medium               Tier 3 Supermarket Type3
## 9                            Tier 2 Supermarket Type1
## 10                           Tier 2 Supermarket Type1
## 11                           Tier 2 Supermarket Type1
## 12                           Tier 2 Supermarket Type1
## 13       Small               Tier 1     Grocery Store
## 14      Medium               Tier 1 Supermarket Type1
## 15        High               Tier 3 Supermarket Type1
## 16      Medium               Tier 1 Supermarket Type1
## 17                           Tier 2 Supermarket Type1
## 18      Medium               Tier 3 Supermarket Type2
## 19      Medium               Tier 3 Supermarket Type3
## 20                           Tier 3     Grocery Store

# The following reads yahoo finance Facebook stock prices remotely
url = "https://query1.finance.yahoo.com/v7/finance/download/FB?period1=1577340844&period2=1608963244&interval=1d&events=history&includeAdjustedClose=true"

FB = read.csv(file = url)

head(FB, n = 15)

##          Date   Open   High    Low  Close Adj.Close   Volume
## 1  2019-12-26 205.57 207.82 205.31 207.79    207.79  9350700
## 2  2019-12-27 208.67 208.93 206.59 208.10    208.10 10284200
## 3  2019-12-30 207.86 207.90 203.90 204.41    204.41 10524300
## 4  2019-12-31 204.00 205.56 203.60 205.25    205.25  8953500
## 5  2020-01-02 206.75 209.79 206.27 209.78    209.78 12077100
## 6  2020-01-03 207.21 210.40 206.95 208.67    208.67 11188400
## 7  2020-01-06 206.70 212.78 206.52 212.60    212.60 17058900
## 8  2020-01-07 212.82 214.58 211.75 213.06    213.06 14912400
## 9  2020-01-08 213.00 216.24 212.61 215.22    215.22 13475000
## 10 2020-01-09 217.54 218.38 216.28 218.30    218.30 12642800
## 11 2020-01-10 219.20 219.88 217.42 218.06    218.06 12119400
## 12 2020-01-13 219.60 221.97 219.21 221.91    221.91 14463400
## 13 2020-01-14 221.61 222.38 218.63 219.06    219.06 13288900
## 14 2020-01-15 220.61 221.68 220.14 221.15    221.15 10036500
## 15 2020-01-16 222.57 222.63 220.39 221.77    221.77 10015300

# The following use the fread() function for reading the same Facebook data. f = fast and friendly
library(data.table) # Load the package first

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

FB2 = fread(input = url)

head(FB2, n = 15)

##           Date   Open   High    Low  Close Adj Close   Volume
##  1: 2019-12-26 205.57 207.82 205.31 207.79    207.79  9350700
##  2: 2019-12-27 208.67 208.93 206.59 208.10    208.10 10284200
##  3: 2019-12-30 207.86 207.90 203.90 204.41    204.41 10524300
##  4: 2019-12-31 204.00 205.56 203.60 205.25    205.25  8953500
##  5: 2020-01-02 206.75 209.79 206.27 209.78    209.78 12077100
##  6: 2020-01-03 207.21 210.40 206.95 208.67    208.67 11188400
##  7: 2020-01-06 206.70 212.78 206.52 212.60    212.60 17058900
##  8: 2020-01-07 212.82 214.58 211.75 213.06    213.06 14912400
##  9: 2020-01-08 213.00 216.24 212.61 215.22    215.22 13475000
## 10: 2020-01-09 217.54 218.38 216.28 218.30    218.30 12642800
## 11: 2020-01-10 219.20 219.88 217.42 218.06    218.06 12119400
## 12: 2020-01-13 219.60 221.97 219.21 221.91    221.91 14463400
## 13: 2020-01-14 221.61 222.38 218.63 219.06    219.06 13288900
## 14: 2020-01-15 220.61 221.68 220.14 221.15    221.15 10036500
## 15: 2020-01-16 222.57 222.63 220.39 221.77    221.77 10015300

Data Visualization Using the “ggplot2” Package

A cheatsheet of ggplot2: https://rstudio.com/wp-content/uploads/2015/04/ggplot2-cheatsheet.pdf

You will need to practice the examples on your own in order to save the class time for other topics!

The structure of the code that creates a ggplot object in R is:

ggplot(data = ?, mapping = aes(x = ?, y = ?, color = ?, fill = ?, …)) + geom_*()

where * can be “line”, “point”, “bar”, “boxplot”, “density”, “dotplot”, “histogram”, “hline”, “vline”, “segment”, “text”, “smooth”, …

y = c(34, 56, 61, 78, 84, 92, 100, 120, 125)
x = c("1990-05-01", "1990-05-02", "1990-05-03", "1990-05-04", "1990-05-05", "1990-05-06", "1990-05-07", "1990-05-08", "1990-05-09")
dx = as.Date(x)

df = data.frame(x, y, dx)

library(ggplot2)

# Line plots
ggplot(data = df, mapping = aes(x=dx, y=y)) +
  geom_line(size = 3, color = "yellow", linetype = "solid") +
  scale_y_continuous(limits = c(40, 150)) + # Restrict the y-scale
  labs(title = "Some Line Plot", x = "Date", y = "Price", subtitle = "A First Example on ggplot", caption = "Made by XYZ") +
  theme(plot.title = element_text(hjust = 0.5, size = 20, color = "red", face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, size = 10, color = "blue", face = "italic"),
        plot.caption = element_text(hjust = 1, size = 8, color = "green"),
        axis.text.x=element_text(size=12, color = "red"),
        axis.text.y=element_text(size=25, color = "pink", face = "bold"),
        axis.title.x=element_text(size=14,face="bold"),
        axis.title.y=element_text(size=30,face="bold"),
        plot.background = element_rect(fill = "lightyellow",
                                colour = "lightblue",
                                size = 0.5, linetype = "dashed"),
        panel.background = element_rect(fill = "lightgray",
                                colour = "lightblue",
                                size = 0.5, linetype = "solid"),
        panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                colour = "purple"), 
        panel.grid.minor = element_line(size = 0.25, linetype = 'dotted',
                                colour = "violet")
  )

## Warning: Removed 1 row(s) containing missing values (geom_path).

# Barplots
ggplot(data = iris, mapping = aes(x = Species, fill = Species)) + 
  geom_bar() +
  theme(legend.position = "none")

# Density plot
ggplot(iris, aes(x = Sepal.Length, color = Species)) + 
  geom_density() +
  labs(color = "Species") + 
  coord_flip()

ggplot(mtcars, aes(x = mpg, color = factor(cyl))) + 
  geom_density() +
  labs(color = "Cyl") + 
  coord_flip()

# Violin plot
ggplot(iris, aes(y = Sepal.Length, x = Species)) + 
  geom_violin() +
  labs(x = "Sepal Length")

titanic = as.data.frame(Titanic)

ggplot(titanic, aes(x = Survived, y = Freq, fill = Class)) +
  geom_col(position = "dodge") # For aggregated (grouped) data, or geom_bar(stats = "identity)

ggplot(titanic, aes(x = Survived, y = Freq, fill = Class)) +
  geom_col(position = "stack") +
  labs(fill = "Classes")

# Faceting
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  facet_grid(cyl ~ .)

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  facet_grid(. ~ cyl) # The "." at left side can be omitted

# When there are many categories for the categorical variable
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  facet_wrap(. ~ manufacturer)

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  facet_grid(cyl ~ class)

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  facet_wrap(cyl ~ class)

# Making your plot interactive using the "plotly" package
# First store the ggplot object in a variable, called p here.
p = ggplot(mpg, aes(displ, hwy)) +
  geom_point()

library(plotly) # Must load the package first

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:igraph':
## 
##     groups

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

ggplotly(p)

# Better scatterplots using the "GGally" package
plot(mtcars) # scatterplot matrix

cor(mtcars) # correlation matrix

##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000

# Two in one
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

mtcars %>% GGally::ggpairs()

Create a choropleth map

A choropleth map (in Greek ‘choro’ means ‘area/region’ and ‘pletho’ means ‘multitude’) is a type of thematic map in which a set of pre-defined areas is colored or patterned in proportion to a statistical variable that represents an aggregate summary of a geographic characteristic within each area, such as population density or per-capita income. (From Wiki)

Violent Crime Rates by US State

The following code is from an R documentation.

library(dplyr)
library(highcharter) 

data("USArrests", package = "datasets")
data("usgeojson")

USArrests$state <- rownames(USArrests)
# Alternatively, USArrests <- mutate(USArrests, state = rownames(USArrests))

highchart() %>%
  hc_title(text = "Violent Crime Rates by US State") %>%
  hc_subtitle(text = "Source: USArrests data") %>%
  hc_add_series_map(
    map = usgeojson, 
    df = USArrests,
    name = "Murder arrests (per 100,000)",
    value = "Murder", joinBy = c("woename", "state"),
    dataLabels = list(
      enabled = TRUE,
      format = "{point.properties.postalcode}"
    )
  ) %>%
  hc_colorAxis(stops = color_stops()) %>%
  hc_legend(valueDecimals = 0, valueSuffix = "%") %>%
  hc_mapNavigation(enabled = TRUE)

World Life Expectancy

Refer to the page https://www.datanovia.com/en/lessons/highchart-interactive-world-map-in-r/ and use the code to practice.

# Load required R packages
#library(tidyverse)
# library(highcharter) 

# Retrieve life expectancy data for the year 2015
library(dplyr)
life.exp <- read.csv("/Users/home/Documents/Zhang/Stat415.515.615/lifeExpectancy.csv")            
life.exp <- life.exp %>%
  filter(Year == 2019) 
head(life.exp)

##           Entity Code Year Life.expectancy
## 1    Afghanistan  AFG 2019        64.83300
## 2         Africa      2019        63.17000
## 3        Albania  ALB 2019        78.57300
## 4        Algeria  DZA 2019        76.88000
## 5 American Samoa  ASM 2019        73.74500
## 6       Americas      2019        76.83539

# Load the world Map data
data(worldgeojson, package = "highcharter")

hc <- highchart() %>%
  hc_add_series_map(
    worldgeojson, life.exp, value = "Life.expectancy", joinBy = c('iso3', "Code"),
    name = "LifeExpectancy"
    )  %>% 
  hc_colorAxis(stops = color_stops()) %>% 
  hc_title(text = "World Map") %>% 
  hc_subtitle(text = "Life Expectancy in 2019")

hc

Create Heat Maps

A heat map (or heatmap) is a data visualization technique that shows magnitude of a phenomenon as color in two dimensions. The variation in color may be by hue or intensity, giving obvious visual cues to the reader about how the phenomenon is clustered or varies over space. (By Wiki)

Refer to: https://www.r-graph-gallery.com/215-the-heatmap-function.html

Heatmaps that display raw values or values after scaling

# The mtcars dataset:
data <- as.matrix(mtcars) # The heatmap function takes as input a matrix.

# Default Heatmap
heatmap(data)

The heatmap just generated is not very insightful: all the variation is absorbed by the “hp” and “disp” variables that have very high values compared to the others. We need to normalize the data.

Normalizing the matrix is done using the scale argument of the heatmap() function. It can be applied to row or to column. Here the column option is chosen, since we need to absorb the variation between column.

# Use 'scale' to normalize
heatmap(data, scale="column")

# No dendrogram nor reordering for neither column or row
heatmap(data, Colv = NA, Rowv = NA, scale="column")

A note: to scale a dataset on columns, use the scale() function in R, as shown below.

scaled.mtcars = scale(mtcars)

# Check
apply(scaled.mtcars, 2, mean) # All column means are now basically zero

##           mpg           cyl          disp            hp          drat 
##  7.112366e-17 -1.474515e-17 -9.084937e-17  1.040834e-17 -2.918672e-16 
##            wt          qsec            vs            am          gear 
##  4.681043e-17  5.299580e-16  6.938894e-18  4.510281e-17 -3.469447e-18 
##          carb 
##  3.165870e-17

apply(scaled.mtcars, 2, sd) # All column standard deviations are now one

##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##    1    1    1    1    1    1    1    1    1    1    1

Heatmaps that display correlations and missing values

For correlation,

# No dendrogram at all
heatmap(cor(mtcars), Colv = NA, Rowv = NA) # Using correlation, so No need to scale

library(gplots)

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

heatmap.2(cor(mtcars), Colv = FALSE, Rowv = FALSE, , cellnote = round(cor(mtcars),2), notecol = "black", dendrogram = "none", trace="none", key = FALSE)

For missing values,

w = c(NA, 7, NA, 34, 12, NA, 44, 21, 26, 56, NA, 45, 34, 12)
x = c( 9, 12, 23, 31, NA, 24, 13, 26, NA, 43, NA, NA, 34, NA)
y = c(32, 11, 7, NA, 8, 2, 3, NA, 2, 8, 12, 21, 54, NA)
z = c(41,  23, NA, 51, 52, 43, NA, 31, NA, 34, 31, NA, 33, NA)
df = data.frame(w, x,y,z)

# What is the difference between the results of the following 2 lines of code?
is.na(x)

##  [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE
## [13] FALSE  TRUE

is.na(x)*1

##  [1] 0 0 0 0 1 0 0 0 1 0 1 1 0 1

# Generate a heatmap that displays missing values
heatmap(is.na(df)*1, Colv = NA, Rowv = NA)

Create Treemaps

A treemap is a space-filling visualization of hierarchical structures. The map is a set of nested rectangles. Each group is represented by a rectangle.

Here is a blog about treemaps: https://www.r-bloggers.com/2018/09/simple-steps-to-create-treemap-in-r/

# The code is from R documentation.

library(treemap)

data(GNI2014)

treemap(
  dtf = GNI2014,
  index=c("continent", "iso3"),
  vSize="population",
  vColor="GNI",
  type="value",
  format.legend = list(scientific = FALSE, big.mark = " "))

Collect Data by Web Scraping with Octoparse

https://www.youtube.com/watch?v=tv9xBVEM5jQ https://www.youtube.com/watch?v=I2GgfDl69No https://www.youtube.com/watch?v=d84VNDgTkjM https://www.youtube.com/watch?v=66clSCQC6Ls https://www.youtube.com/watch?v=j_JWaMnsXWQ