knitr::opts_chunk$set(echo = TRUE)
library(ggplot2) # it's good idea to place your packages in this code chunk
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
R is an interpreted language. When you enter expressions into the R console (or run an R script in batch mode), a program within the R system, called the interpreter, executes the actual code that you wrote. Unlike C, CPP, and Java, there is no need to compile your programs into an object language. Other examples of interpreted languages are Common Lisp, Perl, and JavaScript. (R in a Nutshell, 2nd Edition, by Joseph Adler)
A good reference for the R programming language is https://www.tutorialspoint.com/r/index.htm.
When running code written in R, some packages might be needed. Theses packages must be first installed in one of two ways:
Install on the console by issuing
install.packages(“The package name in double or single quotes”)
Or go to the menu of the lower-right window of your computer screen, click the “Packages” tab and then the “Install” tab, type the package name you want to install, and click “Install” button. The console will show the progress of this installation.
Installation of a package only needs to be done once. To remove a package from your computer, go to the lower-right window again, check the package name in the list of packages, and click “x” at the right margin of your computer screen.
When your R code uses a function or a dataset from a particular package, you need to load the package by issuing
library(“the package name with or without quotes”)
print("Please install the 'igraph' package.")
## [1] "Please install the 'igraph' package."
library(igraph) # Load the package "igraph"
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
# A graph with directed edges: 1->2 2->4 3->1 2->1 3->2 4->1 4->3
g1 <- graph(edges=c(1,2, 2,4, 3,1, 2,1, 3,2, 4,1, 4,3), n=4, directed=TRUE)
plot(g1) # A plot of the network
To run R code or other code within RStudio environment, the code must be in a code chuck. An R code chuck look like
# Your many lines of code are in between
# x=3
# y=4
To add a code chunk, a shortcut is to click the “Insert” tab on the upper-left window of your computer screen, and then choose the appropriate programming language.
It helps yourself and your readers when you put comments with your code. A comment must be prefixed by one or more #’s. Anything after # will be treated as a comment. To comment multiple lines of code on a mac computer, highlight those lines and then use “shift control c”.
Objects are the instances of classes. Everything in R is an object of a certain class. Each class has a certain structure. Basic data structures in R include vectors, matrices, data frames, lists, and factors.
Vectors are one-dimensional arrays. Elements of a vector can be either all numeric values or all character strings. If one element of a vector is a string, the other elements will be treated as strings automatically.
x = 4 # This defines a scalar, which is a numeric vector of length 1
print(x) # This prints x. The name "print" can be omitted.
## [1] 4
y = c(2, 5, 9, 10) # This defines a numeric vecor of length 4. The 4 elements are 2, 5, 9, and 10.
z = 1:10 # This defines a patterned numeric vector of length 10. The elements are 1, 2, ..., 10.
t = seq(3, 100, by = 10) # An arithmetic sequence (vector) with an initial term 3 and an increament 10.
a = y^2 # This defines a numeric vector with elements being the square of the elements of the numeric vector y.
b = log(y) # Natural log-transformation of z to b
u = "Hello World!" # This defines a character vector of length 1.
v = c("David", "Mike", "Rich") # This dedines a character vector of length 3.
w = c("Haha", "Hehe", 5, 10) # the elements of 5 and 10 will be converted to string automatically.
print(w)
## [1] "Haha" "Hehe" "5" "10"
class(y) # This shows the class of the R object y.
## [1] "numeric"
class(w)
## [1] "character"
Matrices are 2-dimensional arrays. A matrix can only hold elements that are either all numeric values or all characters, but a data frame can hold numeric values in some columns and characters in other columns.
M = matrix(1:20, nrow = 4, byrow = TRUE) # This defines a matrix dimension 4 by 5, with elements being 1 through 20.
M
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 6 7 8 9 10
## [3,] 11 12 13 14 15
## [4,] 16 17 18 19 20
dim(M) # This displays the dimension of matrix M.
## [1] 4 5
rownames(M) = c("Row1", "Row2", "Row3", "Row4")
colnames(M) = c("Col1", "Col2", "Col3", "Col4", "Col5")
dimnames(M) # This displays both row names and column names
## [[1]]
## [1] "Row1" "Row2" "Row3" "Row4"
##
## [[2]]
## [1] "Col1" "Col2" "Col3" "Col4" "Col5"
M
## Col1 Col2 Col3 Col4 Col5
## Row1 1 2 3 4 5
## Row2 6 7 8 9 10
## Row3 11 12 13 14 15
## Row4 16 17 18 19 20
D = data.frame(y, a, b, grade = c("A", "B", "B+", "A-") ) # This deines a data frame.
dimnames(D) # This gives both row names and column names of the data frame D
## [[1]]
## [1] "1" "2" "3" "4"
##
## [[2]]
## [1] "y" "a" "b" "grade"
rownames(D) = c("Jenny", "Henny", "Bob", "Tod") # Change row names of D
colnames(D) = c("Y", "A", "B", "Grade") # Change column names. Equivalently, you can use the function "names".
D
## Y A B Grade
## Jenny 2 4 0.6931472 A
## Henny 5 25 1.6094379 B
## Bob 9 81 2.1972246 B+
## Tod 10 100 2.3025851 A-
class(M)
## [1] "matrix" "array"
class(D)
## [1] "data.frame"
Lists in R can hold different elements of any kind. Lists are very important when displaying the outputs of model fitting.
myList = list(A=1:5, B = matrix(1:8, nrow = 2, byrow = TRUE), C = "Hello!", D = data.frame(x=1:4, y = 9:6))
print(myList)
## $A
## [1] 1 2 3 4 5
##
## $B
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
##
## $C
## [1] "Hello!"
##
## $D
## x y
## 1 1 9
## 2 2 8
## 3 3 7
## 4 4 6
class(myList)
## [1] "list"
To encode a vector as a factor with certain levels, use the R function “factor”. Levels must be elements in the vector. Labels are optional. When a vector is converted to a factor, it becomes a categorical variable. Factors are useful when you want to display character vectors in a non-alphabetical order (that is, an order you want).
v = c(1, 1, 3, 0, 1, 0, 3, 4, 4, 1, 2, 0, 1, 2)
# The following encodes the vector v as a factor with levels 0 through 4.
x = factor(v, levels = c(0, 1, 2, 3, 4), labels = c("zero", "one", "two", "three", "four"))
y = factor(v) # By default, the levels are the different values in the natrual order
z = factor(v, levels = 4:0) # The order of levels can be set as desired.
levels(x) # This displays the levels of the factor x
## [1] "zero" "one" "two" "three" "four"
levels(z)
## [1] "4" "3" "2" "1" "0"
class(v)
## [1] "numeric"
class(x)
## [1] "factor"
You can pull out part of elements of a data structure by some subsetting operations. There are three operators that can be used to extract subsets of R objects:
x = (1:8)/10
x[3] # Extract the third element from vector x
## [1] 0.3
x[2:5] # Extract 2nd to 5th elements as a vector
## [1] 0.2 0.3 0.4 0.5
M = matrix(1:35, nrow = 5, byrow = TRUE)
M[4] # Extract the 4th element of M
## [1] 22
M[4, ] # Extract the 4th row as a vector
## [1] 22 23 24 25 26 27 28
M[, 4] # Extract the 4th column as a vector
## [1] 4 11 18 25 32
M[2, 5] # Extract the element at the intersection of second row and 5th column.
## [1] 12
M[2:4, ] # Extract the second to 4th rows
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 8 9 10 11 12 13 14
## [2,] 15 16 17 18 19 20 21
## [3,] 22 23 24 25 26 27 28
M[, 2:4] # Extract the second to 4th columns
## [,1] [,2] [,3]
## [1,] 2 3 4
## [2,] 9 10 11
## [3,] 16 17 18
## [4,] 23 24 25
## [5,] 30 31 32
y = data.frame(a=1:10, b = 5:14)
y[2] # Extract the second column and make it a new data frame (not useful).
## b
## 1 5
## 2 6
## 3 7
## 4 8
## 5 9
## 6 10
## 7 11
## 8 12
## 9 13
## 10 14
y[[2]] # Extract the second column as a vector.
## [1] 5 6 7 8 9 10 11 12 13 14
y[2, ] # Extract the second row as a vector.
## a b
## 2 2 6
y[, 2] # Same as y[[2]]
## [1] 5 6 7 8 9 10 11 12 13 14
y$b # Extract the "b" column as a vector. Dollar is good, but not necessary!
## [1] 5 6 7 8 9 10 11 12 13 14
y$"b" # Same as y$b
## [1] 5 6 7 8 9 10 11 12 13 14
y["b"] # Same as y[2], a new data frame (not useful).
## b
## 1 5
## 2 6
## 3 7
## 4 8
## 5 9
## 6 10
## 7 11
## 8 12
## 9 13
## 10 14
y[,"b"] # Same as y[, 2]
## [1] 5 6 7 8 9 10 11 12 13 14
y[["b"]] # Same as y["b"]
## [1] 5 6 7 8 9 10 11 12 13 14
y[3:6, ] # Extract the 3rd to 6th rows as a new data frame
## a b
## 3 3 7
## 4 4 8
## 5 5 9
## 6 6 10
myList = list(A=1:5, B = matrix(1:8, nrow = 2, byrow = TRUE), C = "Hello!", D = data.frame(x=1:4, y = 9:6))
myList[4] # Extract the 4th element as a new list with only one element D.
## $D
## x y
## 1 1 9
## 2 2 8
## 3 3 7
## 4 4 6
myList[[4]] # Not a list any more
## x y
## 1 1 9
## 2 2 8
## 3 3 7
## 4 4 6
myList$D # Same as myList[[4]]
## x y
## 1 1 9
## 2 2 8
## 3 3 7
## 4 4 6
Frequently, we need to compare two R objects whether they are the same or not, or one is greater.
x = 4
y = 5
z = (x > y)
print(z)
## [1] FALSE
w = (x <= y)
print(w)
## [1] TRUE
a = "abc"
b = "abC"
d = (a != b) # Is a not equal to b?
print(d)
## [1] TRUE
q = c(2, 9, 11, 45, 34, 8, 24, 15, 5, 7, 21)
r = 5
s = (q>r)
print(s)
## [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
u = TRUE
v = "TRUE"
class(u)
## [1] "logical"
class(v)
## [1] "character"
D2 = mtcars[mtcars$cyl == 4, c(2, 7, 9)] # A subset of mtcars for which cyl = 4.
D2
## cyl qsec am
## Datsun 710 4 18.61 1
## Merc 240D 4 20.00 0
## Merc 230 4 22.90 0
## Fiat 128 4 19.47 1
## Honda Civic 4 18.52 1
## Toyota Corolla 4 19.90 1
## Toyota Corona 4 20.01 0
## Fiat X1-9 4 18.90 1
## Porsche 914-2 4 16.70 1
## Lotus Europa 4 16.90 1
## Volvo 142E 4 18.60 1
t = q[q>10] # a vector containing values that are greater than 10
print(t)
## [1] 11 45 34 24 15 21
x = 15
# 4 branches: The number line is divided into 4 intervals:
# (-infinity, 10), [10, 20), [20, 30), and [30, infinity)
if (x < 10){
y = 2*x - 3
} else if (x <20){
y = 3*x + 4
} else if (x < 30){
y = 5*x - 12
} else{
y = 10000
}
print(y)
## [1] 49
# 2 branches
States = c("MN", "FL", "IL", "CA")
state = "IL"
if (state %in% States){
message = "Found it!"
} else{
message = "Not found."
}
print(message)
## [1] "Found it!"
A loop in a programming language can perform the operation repeatedly. Like many other programming languages, R has for loops and while loops.
## The following gives a way of calculating the sum of the first 100 natural numbers.
sum = 0 # Initial value is 0
for (k in 1:100){
sum = sum + k
}
print(sum)
## [1] 5050
## The following gives another way of calculating the sum of the first 100 natural numbers.
sum = 0
k = 1
while (k <= 100){
sum = sum + k
k = k + 1
}
print(sum)
## [1] 5050
## Or, we simply call a function to do the job
sum(1:100)
## [1] 5050
A function in R is an object so the R interpreter is able to pass control to the function, along with arguments that may be necessary for the function to accomplish the actions. The function in turn performs its task and returns control to the interpreter as well as any result which may be stored in other objects. (From https://www.tutorialspoint.com/r/r_functions.htm)
Lots of built-in functions are available in R. We have used quite many function above. To check out the details of a built-in function in R, type ?functionName in the R console.
A few very useful built-in functions are demonstrated below.
# Create a vector of 10 zeros
x = numeric(10)
x
## [1] 0 0 0 0 0 0 0 0 0 0
x[4] = 1000 # Reset the number element of the numeric vector to 1000
x
## [1] 0 0 0 1000 0 0 0 0 0 0
# Create a vector of 10 empty space characters
y = character(10)
print(mtcars) # "mtcars" is a data frame available from the "base" package
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
head(mtcars, n = 10) # Display only the first 10 rows of the data
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
nrow(mtcars) # Display the number of rows in the data
## [1] 32
names(mtcars) # Display the column names of data
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
colnames(mtcars) # column names
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
rownames(mtcars) # row names
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D" "Merc 230"
## [10] "Merc 280" "Merc 280C" "Merc 450SE"
## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
## [31] "Maserati Bora" "Volvo 142E"
dimnames(mtcars) # Both row and column names
## [[1]]
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D" "Merc 230"
## [10] "Merc 280" "Merc 280C" "Merc 450SE"
## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
## [31] "Maserati Bora" "Volvo 142E"
##
## [[2]]
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
str(mtcars) # Display the structure of the mtcars data frame
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
class(mtcars) # The class of the data
## [1] "data.frame"
summary(mtcars) # Summarize each column of the mtcars data frame
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
plot(mtcars) # Scatterplot matrix
# Add a new column to a data frame
D = mtcars # Create a copy
D$log.mpg = log(D$mpg)
D$sq.wt = D$wt^2
D
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## log.mpg sq.wt
## Mazda RX4 3.044522 6.864400
## Mazda RX4 Wag 3.044522 8.265625
## Datsun 710 3.126761 5.382400
## Hornet 4 Drive 3.063391 10.336225
## Hornet Sportabout 2.928524 11.833600
## Valiant 2.895912 11.971600
## Duster 360 2.660260 12.744900
## Merc 240D 3.194583 10.176100
## Merc 230 3.126761 9.922500
## Merc 280 2.954910 11.833600
## Merc 280C 2.879198 11.833600
## Merc 450SE 2.797281 16.564900
## Merc 450SL 2.850707 13.912900
## Merc 450SLC 2.721295 14.288400
## Cadillac Fleetwood 2.341806 27.562500
## Lincoln Continental 2.341806 29.419776
## Chrysler Imperial 2.687847 28.569025
## Fiat 128 3.478158 4.840000
## Honda Civic 3.414443 2.608225
## Toyota Corolla 3.523415 3.367225
## Toyota Corona 3.068053 6.076225
## Dodge Challenger 2.740840 12.390400
## AMC Javelin 2.721295 11.799225
## Camaro Z28 2.587764 14.745600
## Pontiac Firebird 2.954910 14.784025
## Fiat X1-9 3.306887 3.744225
## Porsche 914-2 3.258097 4.579600
## Lotus Europa 3.414443 2.289169
## Ford Pantera L 2.760010 10.048900
## Ferrari Dino 2.980619 7.672900
## Maserati Bora 2.708050 12.744900
## Volvo 142E 3.063391 7.728400
# Equivalently
library(dplyr)
D = mutate(mtcars,
log.mpg = log(mpg),
sq.wt = wt^2
)
D
## mpg cyl disp hp drat wt qsec vs am gear carb log.mpg sq.wt
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 3.044522 6.864400
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3.044522 8.265625
## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3.126761 5.382400
## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 3.063391 10.336225
## 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 2.928524 11.833600
## 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 2.895912 11.971600
## 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 2.660260 12.744900
## 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 3.194583 10.176100
## 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 3.126761 9.922500
## 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 2.954910 11.833600
## 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 2.879198 11.833600
## 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 2.797281 16.564900
## 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 2.850707 13.912900
## 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 2.721295 14.288400
## 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 2.341806 27.562500
## 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 2.341806 29.419776
## 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 2.687847 28.569025
## 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 3.478158 4.840000
## 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 3.414443 2.608225
## 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 3.523415 3.367225
## 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 3.068053 6.076225
## 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 2.740840 12.390400
## 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 2.721295 11.799225
## 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 2.587764 14.745600
## 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 2.954910 14.784025
## 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 3.306887 3.744225
## 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 3.258097 4.579600
## 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 3.414443 2.289169
## 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 2.760010 10.048900
## 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 2.980619 7.672900
## 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 2.708050 12.744900
## 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 3.063391 7.728400
# Rename columns of a data frame
D = mtcars
D = rename(D,
"mile per gallon" = mpg,
"cylinder" = cyl,
"horse power" = hp
)
D
## mile per gallon cylinder disp horse power drat wt qsec
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60
## vs am gear carb
## Mazda RX4 0 1 4 4
## Mazda RX4 Wag 0 1 4 4
## Datsun 710 1 1 4 1
## Hornet 4 Drive 1 0 3 1
## Hornet Sportabout 0 0 3 2
## Valiant 1 0 3 1
## Duster 360 0 0 3 4
## Merc 240D 1 0 4 2
## Merc 230 1 0 4 2
## Merc 280 1 0 4 4
## Merc 280C 1 0 4 4
## Merc 450SE 0 0 3 3
## Merc 450SL 0 0 3 3
## Merc 450SLC 0 0 3 3
## Cadillac Fleetwood 0 0 3 4
## Lincoln Continental 0 0 3 4
## Chrysler Imperial 0 0 3 4
## Fiat 128 1 1 4 1
## Honda Civic 1 1 4 2
## Toyota Corolla 1 1 4 1
## Toyota Corona 1 0 3 1
## Dodge Challenger 0 0 3 2
## AMC Javelin 0 0 3 2
## Camaro Z28 0 0 3 4
## Pontiac Firebird 0 0 3 2
## Fiat X1-9 1 1 4 1
## Porsche 914-2 0 1 5 2
## Lotus Europa 1 1 5 2
## Ford Pantera L 0 1 5 4
## Ferrari Dino 0 1 5 6
## Maserati Bora 0 1 5 8
## Volvo 142E 1 1 4 2
# Alternatively
D = mtcars
names(D)[c(1, 2, 4)] = c("mile per gallon", "cylinder", "horse power")
D
## mile per gallon cylinder disp horse power drat wt qsec
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60
## vs am gear carb
## Mazda RX4 0 1 4 4
## Mazda RX4 Wag 0 1 4 4
## Datsun 710 1 1 4 1
## Hornet 4 Drive 1 0 3 1
## Hornet Sportabout 0 0 3 2
## Valiant 1 0 3 1
## Duster 360 0 0 3 4
## Merc 240D 1 0 4 2
## Merc 230 1 0 4 2
## Merc 280 1 0 4 4
## Merc 280C 1 0 4 4
## Merc 450SE 0 0 3 3
## Merc 450SL 0 0 3 3
## Merc 450SLC 0 0 3 3
## Cadillac Fleetwood 0 0 3 4
## Lincoln Continental 0 0 3 4
## Chrysler Imperial 0 0 3 4
## Fiat 128 1 1 4 1
## Honda Civic 1 1 4 2
## Toyota Corolla 1 1 4 1
## Toyota Corona 1 0 3 1
## Dodge Challenger 0 0 3 2
## AMC Javelin 0 0 3 2
## Camaro Z28 0 0 3 4
## Pontiac Firebird 0 0 3 2
## Fiat X1-9 1 1 4 1
## Porsche 914-2 0 1 5 2
## Lotus Europa 1 1 5 2
## Ford Pantera L 0 1 5 4
## Ferrari Dino 0 1 5 6
## Maserati Bora 0 1 5 8
## Volvo 142E 1 1 4 2
# Subsetting a data frame by selecting some columns
library(dplyr)
select(mtcars, disp, wt, hp)
## disp wt hp
## Mazda RX4 160.0 2.620 110
## Mazda RX4 Wag 160.0 2.875 110
## Datsun 710 108.0 2.320 93
## Hornet 4 Drive 258.0 3.215 110
## Hornet Sportabout 360.0 3.440 175
## Valiant 225.0 3.460 105
## Duster 360 360.0 3.570 245
## Merc 240D 146.7 3.190 62
## Merc 230 140.8 3.150 95
## Merc 280 167.6 3.440 123
## Merc 280C 167.6 3.440 123
## Merc 450SE 275.8 4.070 180
## Merc 450SL 275.8 3.730 180
## Merc 450SLC 275.8 3.780 180
## Cadillac Fleetwood 472.0 5.250 205
## Lincoln Continental 460.0 5.424 215
## Chrysler Imperial 440.0 5.345 230
## Fiat 128 78.7 2.200 66
## Honda Civic 75.7 1.615 52
## Toyota Corolla 71.1 1.835 65
## Toyota Corona 120.1 2.465 97
## Dodge Challenger 318.0 3.520 150
## AMC Javelin 304.0 3.435 150
## Camaro Z28 350.0 3.840 245
## Pontiac Firebird 400.0 3.845 175
## Fiat X1-9 79.0 1.935 66
## Porsche 914-2 120.3 2.140 91
## Lotus Europa 95.1 1.513 113
## Ford Pantera L 351.0 3.170 264
## Ferrari Dino 145.0 2.770 175
## Maserati Bora 301.0 3.570 335
## Volvo 142E 121.0 2.780 109
# Alternatively
mtcars[ , c(3, 6, 4)]
## disp wt hp
## Mazda RX4 160.0 2.620 110
## Mazda RX4 Wag 160.0 2.875 110
## Datsun 710 108.0 2.320 93
## Hornet 4 Drive 258.0 3.215 110
## Hornet Sportabout 360.0 3.440 175
## Valiant 225.0 3.460 105
## Duster 360 360.0 3.570 245
## Merc 240D 146.7 3.190 62
## Merc 230 140.8 3.150 95
## Merc 280 167.6 3.440 123
## Merc 280C 167.6 3.440 123
## Merc 450SE 275.8 4.070 180
## Merc 450SL 275.8 3.730 180
## Merc 450SLC 275.8 3.780 180
## Cadillac Fleetwood 472.0 5.250 205
## Lincoln Continental 460.0 5.424 215
## Chrysler Imperial 440.0 5.345 230
## Fiat 128 78.7 2.200 66
## Honda Civic 75.7 1.615 52
## Toyota Corolla 71.1 1.835 65
## Toyota Corona 120.1 2.465 97
## Dodge Challenger 318.0 3.520 150
## AMC Javelin 304.0 3.435 150
## Camaro Z28 350.0 3.840 245
## Pontiac Firebird 400.0 3.845 175
## Fiat X1-9 79.0 1.935 66
## Porsche 914-2 120.3 2.140 91
## Lotus Europa 95.1 1.513 113
## Ford Pantera L 351.0 3.170 264
## Ferrari Dino 145.0 2.770 175
## Maserati Bora 301.0 3.570 335
## Volvo 142E 121.0 2.780 109
# or
mtcars[c(3, 6, 4)]
## disp wt hp
## Mazda RX4 160.0 2.620 110
## Mazda RX4 Wag 160.0 2.875 110
## Datsun 710 108.0 2.320 93
## Hornet 4 Drive 258.0 3.215 110
## Hornet Sportabout 360.0 3.440 175
## Valiant 225.0 3.460 105
## Duster 360 360.0 3.570 245
## Merc 240D 146.7 3.190 62
## Merc 230 140.8 3.150 95
## Merc 280 167.6 3.440 123
## Merc 280C 167.6 3.440 123
## Merc 450SE 275.8 4.070 180
## Merc 450SL 275.8 3.730 180
## Merc 450SLC 275.8 3.780 180
## Cadillac Fleetwood 472.0 5.250 205
## Lincoln Continental 460.0 5.424 215
## Chrysler Imperial 440.0 5.345 230
## Fiat 128 78.7 2.200 66
## Honda Civic 75.7 1.615 52
## Toyota Corolla 71.1 1.835 65
## Toyota Corona 120.1 2.465 97
## Dodge Challenger 318.0 3.520 150
## AMC Javelin 304.0 3.435 150
## Camaro Z28 350.0 3.840 245
## Pontiac Firebird 400.0 3.845 175
## Fiat X1-9 79.0 1.935 66
## Porsche 914-2 120.3 2.140 91
## Lotus Europa 95.1 1.513 113
## Ford Pantera L 351.0 3.170 264
## Ferrari Dino 145.0 2.770 175
## Maserati Bora 301.0 3.570 335
## Volvo 142E 121.0 2.780 109
# or
mtcars[, c("disp", "wt", "hp")]
## disp wt hp
## Mazda RX4 160.0 2.620 110
## Mazda RX4 Wag 160.0 2.875 110
## Datsun 710 108.0 2.320 93
## Hornet 4 Drive 258.0 3.215 110
## Hornet Sportabout 360.0 3.440 175
## Valiant 225.0 3.460 105
## Duster 360 360.0 3.570 245
## Merc 240D 146.7 3.190 62
## Merc 230 140.8 3.150 95
## Merc 280 167.6 3.440 123
## Merc 280C 167.6 3.440 123
## Merc 450SE 275.8 4.070 180
## Merc 450SL 275.8 3.730 180
## Merc 450SLC 275.8 3.780 180
## Cadillac Fleetwood 472.0 5.250 205
## Lincoln Continental 460.0 5.424 215
## Chrysler Imperial 440.0 5.345 230
## Fiat 128 78.7 2.200 66
## Honda Civic 75.7 1.615 52
## Toyota Corolla 71.1 1.835 65
## Toyota Corona 120.1 2.465 97
## Dodge Challenger 318.0 3.520 150
## AMC Javelin 304.0 3.435 150
## Camaro Z28 350.0 3.840 245
## Pontiac Firebird 400.0 3.845 175
## Fiat X1-9 79.0 1.935 66
## Porsche 914-2 120.3 2.140 91
## Lotus Europa 95.1 1.513 113
## Ford Pantera L 351.0 3.170 264
## Ferrari Dino 145.0 2.770 175
## Maserati Bora 301.0 3.570 335
## Volvo 142E 121.0 2.780 109
# or
mtcars[c("disp", "wt", "hp")]
## disp wt hp
## Mazda RX4 160.0 2.620 110
## Mazda RX4 Wag 160.0 2.875 110
## Datsun 710 108.0 2.320 93
## Hornet 4 Drive 258.0 3.215 110
## Hornet Sportabout 360.0 3.440 175
## Valiant 225.0 3.460 105
## Duster 360 360.0 3.570 245
## Merc 240D 146.7 3.190 62
## Merc 230 140.8 3.150 95
## Merc 280 167.6 3.440 123
## Merc 280C 167.6 3.440 123
## Merc 450SE 275.8 4.070 180
## Merc 450SL 275.8 3.730 180
## Merc 450SLC 275.8 3.780 180
## Cadillac Fleetwood 472.0 5.250 205
## Lincoln Continental 460.0 5.424 215
## Chrysler Imperial 440.0 5.345 230
## Fiat 128 78.7 2.200 66
## Honda Civic 75.7 1.615 52
## Toyota Corolla 71.1 1.835 65
## Toyota Corona 120.1 2.465 97
## Dodge Challenger 318.0 3.520 150
## AMC Javelin 304.0 3.435 150
## Camaro Z28 350.0 3.840 245
## Pontiac Firebird 400.0 3.845 175
## Fiat X1-9 79.0 1.935 66
## Porsche 914-2 120.3 2.140 91
## Lotus Europa 95.1 1.513 113
## Ford Pantera L 351.0 3.170 264
## Ferrari Dino 145.0 2.770 175
## Maserati Bora 301.0 3.570 335
## Volvo 142E 121.0 2.780 109
# We can also deselect some columns to form a subset
select(mtcars, -am, -carb)
## mpg cyl disp hp drat wt qsec vs gear
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 4
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 3
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 3
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 3
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 3
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 4
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 4
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 3
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 3
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 3
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 4
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 4
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 4
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 3
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 3
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 3
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 3
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 3
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 4
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 5
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 5
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 5
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 5
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 5
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 4
# Subsetting a data frame by selecting some rows meeting some conditions
D = subset(mtcars, cyl == 6 & gear %in% c(3, 5) & hp > 100)
D
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## Ferrari Dino 19.7 6 145 175 3.62 2.770 15.50 0 1 5 6
# or
D = mtcars[mtcars$cyl == 6 & mtcars$gear %in% c(3, 5) & mtcars$hp > 100, ]
D
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## Ferrari Dino 19.7 6 145 175 3.62 2.770 15.50 0 1 5 6
str(iris) # The data frame "iris" is also from the "base" package
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
plot(iris)
head(mpg) # Display the first 6 rows of data frame "mpg"
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
table(mpg$class) # Tabulate the class column in the data frame "mpg": a frequency table
##
## 2seater compact midsize minivan pickup subcompact suv
## 5 47 41 11 33 35 62
# Create a random sample from a finite population with known elements
population = c(12, 56, 87, 43, 56, 54, 82, 34, 61, 52, 84, 97, 37, 28, 39)
y = sample(x = population, size = 5, replace = FALSE) # Sampling 5 values from the population without replacement
y
## [1] 43 54 39 52 61
# Create a sample from a discrete population with a known distribution.
z = sample(1:3, size = 1000, prob = c(0.7, 0.15, 0.15), replace = TRUE) # The sampling will have to be w/ replacement
table(z)/1000*100 # Check the quality of the sample to see how close it is to the population
## z
## 1 2 3
## 68.2 16.7 15.1
# A question: How can we write a function that randomly choose a given number of rows
# from an existing data frame to create a new data frame? This is called data partition.
# Group your data by one or more categorical variables and then summarize the grouped data
library(dplyr) # Load the package
mySummary = mtcars %>% group_by(vs, am) %>% summarise(n = n())
## `summarise()` regrouping output by 'vs' (override with `.groups` argument)
mySummary
## # A tibble: 4 x 3
## # Groups: vs [2]
## vs am n
## <dbl> <dbl> <int>
## 1 0 0 12
## 2 0 1 6
## 3 1 0 7
## 4 1 1 7
# Remove all the objects we created so far. This can be very useful!
rm(list = ls())
# Round values
x = c(100.45, 67.35, 78.82, 98.43, - 67.41, -84.92)
round(x, 1) # round to one decimal place
## [1] 100.4 67.3 78.8 98.4 -67.4 -84.9
round(x, 0) # round to the nearest whole number
## [1] 100 67 79 98 -67 -85
round(x, -1) # round to the nearest 10th
## [1] 100 70 80 100 -70 -80
# Paste a few strings with a separator.
paste("Tomorrow is ", Sys.Date() + 1, ", the due date for ", "project #", 15, ". ", "Don't miss it!", sep = "")
## [1] "Tomorrow is 2021-02-02, the due date for project #15. Don't miss it!"
# The switch() function for conditional execution
x = c(45, 78, 93, 25, 54, 80)
stats = "Sd"
switch(stats,
Mean = mean(x),
SD = sd(x),
Median = median(x),
Summary = summary(x),
cat("Sorry, it goes beyond my capacity.")
)
## Sorry, it goes beyond my capacity.
# Handling dates
y = c(34, 56, 61, 78, 84, 92, 100, 120, 125)
x = c("1990-05-01", "1990-05-02", "1990-05-03", "1990-05-04", "1990-05-05", "1990-05-06", "1990-05-07", "1990-05-08", "1990-05-09")
dx = as.Date(x)
dx
## [1] "1990-05-01" "1990-05-02" "1990-05-03" "1990-05-04" "1990-05-05"
## [6] "1990-05-06" "1990-05-07" "1990-05-08" "1990-05-09"
class(x)
## [1] "character"
class(dx)
## [1] "Date"
plot(y~dx, xlab = "Date")
D = Sys.Date() # Extract the date of today
weekdays(D) # Extract the week day
## [1] "Monday"
months(D)
## [1] "February"
quarters(D)
## [1] "Q1"
julian(D) # Number of days since the origin (1970-01-01)
## [1] 18659
## attr(,"origin")
## [1] "1970-01-01"
julian(D, origin = as.Date("2000-07-01")) # Number of days since the origin (2000-07-01)
## [1] 7520
## attr(,"origin")
## [1] "2000-07-01"
R users can also define their own functions. The structure of a user-defined function is R looks like the following:
# functionName = function(a list of parameters/arguments separated by comma){
# The function body with the last line being the returned value (can be any data structure)
# }
The different parts of a function are
Function Name- This is the actual name of the function. It is stored in R environment as an object with this name.
Arguments- An argument is a placeholder. When a function is invoked, you pass a value to the argument. Arguments are optional; that is, a function may contain no arguments. Also arguments can have default values.
Function Body- The function body contains a collection of statements that defines what the function does.
Return Value- The return value of a function is the last expression in the function body to be evaluated.
(From https://www.tutorialspoint.com/r/r_functions.htm)
f1 = function(x){
2*x^2 - 3/x +1
}
# A better one that handles abnormality
f2 = function(x){
if (x != 0) {
2*x^2 - 3/x +1
} else {
cat("Can't be done due to zero denominator!\n", "Please use a non-zero input.")
}
}
# A function for a simple summary of a numeric sample
mySummary = function(x){
Mean = mean(x)
Median = median(x)
Std = sd(x)
list(Mean = Mean, Median = Median, "Standard Deviation" = Std)
}
# A function that prints the elements of a vector reversely
rprint = function(x){
n = length(x)
v = NULL
for (i in 1:n){
v[i] = x[n-i+1]
}
v
}
rprint(1:9)
## [1] 9 8 7 6 5 4 3 2 1
rprint(letters)
## [1] "z" "y" "x" "w" "v" "u" "t" "s" "r" "q" "p" "o" "n" "m" "l" "k" "j" "i" "h"
## [20] "g" "f" "e" "d" "c" "b" "a"
# There is a built-in function in R that gives the reversal
rev(1:9)
## [1] 9 8 7 6 5 4 3 2 1
rev(letters)
## [1] "z" "y" "x" "w" "v" "u" "t" "s" "r" "q" "p" "o" "n" "m" "l" "k" "j" "i" "h"
## [20] "g" "f" "e" "d" "c" "b" "a"
# A user-defined function that partitions a data frame into training data, validation data, and test data
partition = function(D, prop = c(0.6, 0.2, 0.2)){ # D is a data frame to partition
n = nrow(D)
idx = sample(x = 1:n, size = n, replace = FALSE) # Shuffle the original rows of D
shuffled = D[idx, ]
n1 = round(n * prop[1])
n2 = round(n * prop[2])
n3 = n - n1 - n2 # Can I do n3 = round(n * prop[3])?
training = shuffled[1:n1, ]
validation = shuffled[(n1+1):(n1+n2), ]
test = shuffled[(n1 + n2 + 1):n, ]
L = list(training = training, validation = validation, test = test)
return(L)
}
partition(mtcars, c(0.7, 0.15, 0.15))
## $training
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
##
## $validation
## mpg cyl disp hp drat wt qsec vs am gear carb
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
##
## $test
## mpg cyl disp hp drat wt qsec vs am gear carb
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
The pipe operator, %>% , comes from the “magrittr” package. The point of pipe is to help you write human-friendly code.
In mathematics, you can make a composite function by doing something like y = f(g(h(x))), which is equivalent to
\[x \rightarrow h() \rightarrow g() \rightarrow f() = y\] The above process involves 4 steps:
Step 1: Input “x” into the function “h”.
Step 2: Input the result “h(x)” from step 1 into the function “g”.
Step 3: Input the result “g(h(x))” from step 2 into the function “f”.
Step 4: The output is “f(g(h(x)))” and assigned to “y”.
In the “magrittr” package, the right arrow $" is represented by “%>%”. Here are examples.
x = c( 23, 45, 34, 78, 12, 56)
mean(x)
## [1] 41.33333
# The following 3 lines of code are each equivalent to the previous line
x %>% mean() # that is, we can factor x out!
## [1] 41.33333
x %>%mean
## [1] 41.33333
x %>% mean(.) # "." is a placeholder
## [1] 41.33333
x %>% sqrt() %>% sum() # A chain rule: just for fun
## [1] 37.11416
# The following gives a more realistic example.
library(dplyr) # The count() function is from this package.
print(starwars) # A dataset from the package
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Dart… 202 136 none white yellow 41.9 male mascu…
## 5 Leia… 150 49 brown light brown 19 fema… femin…
## 6 Owen… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Bigg… 183 84 black light brown 24 male mascu…
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male mascu…
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
# The column "species" is a categorical variable in the data "starwars".
# The following gets its distribution, which is discrete.
D1 <- starwars %>% count(species)
D1
## # A tibble: 38 x 2
## species n
## <chr> <int>
## 1 Aleena 1
## 2 Besalisk 1
## 3 Cerean 1
## 4 Chagrian 1
## 5 Clawdite 1
## 6 Droid 6
## 7 Dug 1
## 8 Ewok 1
## 9 Geonosian 1
## 10 Gungan 3
## # … with 28 more rows
# Plot the discrete distribution
barplot(height = D1$n, names = D1$species)
with(D1, barplot(height = n, names = species)) # Alternatively
# Sort the distribution table by the frequency (column "n")
D2 <- starwars %>% count(species, sort = TRUE)
D2
## # A tibble: 38 x 2
## species n
## <chr> <int>
## 1 Human 35
## 2 Droid 6
## 3 <NA> 4
## 4 Gungan 3
## 5 Kaminoan 2
## 6 Mirialan 2
## 7 Twi'lek 2
## 8 Wookiee 2
## 9 Zabrak 2
## 10 Aleena 1
## # … with 28 more rows
# Plot the discrete distribution
bp = barplot(height = D2$n, names = D2$species, ylim = c(0, max(D2$n)*1.1), las = 2) # A sorted barchart, called the Pareto chart
with(D2, barplot(height = n, names = species)) # Alternatively
# The following adds labels: above (pos = 3) bars by 10% of the size of the character width
text(bp, D2$n*0.9, labels = D2$n, pos = 3, offset = 0.1, col = "red")
title("Distribution of Species in Starwars", col.main = "blue", cex.main = 2, sub = "(data courtesy of xyz)")
# Another way to plot: just for illustration and not recommended
starwars %>% .$species %>% table() %>% sort(., decreasing = TRUE) %>% barplot(ylim = c(0, max(D2$n)*1.1), las = 2) %>% text(., D2$n, labels = D2$n, pos = 3, offset = 0.1, col = "red")
# Joint distribution of "sex" and "gender" and sort by the frequency (column "n")
D3 <- starwars %>% count(sex, gender, sort = TRUE)
D3
## # A tibble: 6 x 3
## sex gender n
## <chr> <chr> <int>
## 1 male masculine 60
## 2 female feminine 16
## 3 none masculine 5
## 4 <NA> <NA> 4
## 5 hermaphroditic masculine 1
## 6 none feminine 1
Your data may have missing values. In R, missing values are indicated by “NA”. Missing values can be removed or imputed depending on the context. There are lots of research done on missing values.
x = c(2, 6, 9, NA, 10, 23, NA, 30)
print(x)
## [1] 2 6 9 NA 10 23 NA 30
mean(x) # Produces NA
## [1] NA
sd(x) # Produces NA
## [1] NA
summary(x) # NA's handled
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.00 6.75 9.50 13.33 19.75 30.00 2
# Handling missing values by simply removing them with the "na.rm" option.
mean(x, na.rm = TRUE)
## [1] 13.33333
sd(x, na.rm = TRUE)
## [1] 10.80123
# Remove the missing values to create a new vector
y = as.numeric(na.omit(x))
print(y)
## [1] 2 6 9 10 23 30
mean(y)
## [1] 13.33333
D1 = data.frame(x = c(1:10, NA, 12:15), y = c(2, 4, 5, 1, 0, NA, 7.2, NA, 10, 13.4, 15.2, NA, 18.5, 11, 20.5))
D2 = na.omit(D1) # Remove rows with a missing value
D1
## x y
## 1 1 2.0
## 2 2 4.0
## 3 3 5.0
## 4 4 1.0
## 5 5 0.0
## 6 6 NA
## 7 7 7.2
## 8 8 NA
## 9 9 10.0
## 10 10 13.4
## 11 NA 15.2
## 12 12 NA
## 13 13 18.5
## 14 14 11.0
## 15 15 20.5
D2
## x y
## 1 1 2.0
## 2 2 4.0
## 3 3 5.0
## 4 4 1.0
## 5 5 0.0
## 7 7 7.2
## 9 9 10.0
## 10 10 13.4
## 13 13 18.5
## 14 14 11.0
## 15 15 20.5
In R, we can read data from files stored outside the R environment. We can also write data into files which will be stored and accessed by the operating system. R can read and write into various file formats like csv, excel, etc.
If the data are on the local computer, it is convenient if they are stored in the same working directory where most of your R files are stored.
You can check which directory the R workspace is pointing to using the “getwd” function. You can also set a new working directory using “setwd” function.
getwd() # Get the working directory
## [1] "/Users/home/Documents/Zhang/Stat415.515.615"
#setwd("/Users/home/Documents/Zhang/Stat415.515.615") # Set it to a new one
#getwd()
# MyData is a data frame made from the raw csv data
myData = read.csv("Sales.csv")
head(myData, n = 20) # Display only the first 20 rows of the data frame
## Item_Identifier Item_Weight Item_Fat_Content Item_Visibility
## 1 FDW58 20.750 Low Fat 0.007564836
## 2 FDW14 8.300 reg 0.038427677
## 3 NCN55 14.600 Low Fat 0.099574908
## 4 FDQ58 7.315 Low Fat 0.015388393
## 5 FDY38 NA Regular 0.118599314
## 6 FDH56 9.800 Regular 0.063817206
## 7 FDL48 19.350 Regular 0.082601537
## 8 FDC48 NA Low Fat 0.015782495
## 9 FDN33 6.305 Regular 0.123365446
## 10 FDA36 5.985 Low Fat 0.005698435
## 11 FDT44 16.600 Low Fat 0.103569075
## 12 FDQ56 6.590 Low Fat 0.105811470
## 13 NCC54 NA Low Fat 0.171079215
## 14 FDU11 4.785 Low Fat 0.092737611
## 15 DRL59 16.750 LF 0.021206464
## 16 FDM24 6.135 Regular 0.079450700
## 17 FDI57 19.850 Low Fat 0.054135210
## 18 DRC12 17.850 Low Fat 0.037980963
## 19 NCM42 NA Low Fat 0.028184344
## 20 FDA46 13.600 Low Fat 0.196897637
## Item_Type Item_MRP Outlet_Identifier Outlet_Establishment_Year
## 1 Snack Foods 107.8622 OUT049 1999
## 2 Dairy 87.3198 OUT017 2007
## 3 Others 241.7538 OUT010 1998
## 4 Snack Foods 155.0340 OUT017 2007
## 5 Dairy 234.2300 OUT027 1985
## 6 Fruits and Vegetables 117.1492 OUT046 1997
## 7 Baking Goods 50.1034 OUT018 2009
## 8 Baking Goods 81.0592 OUT027 1985
## 9 Snack Foods 95.7436 OUT045 2002
## 10 Baking Goods 186.8924 OUT017 2007
## 11 Fruits and Vegetables 118.3466 OUT017 2007
## 12 Fruits and Vegetables 85.3908 OUT045 2002
## 13 Health and Hygiene 240.4196 OUT019 1985
## 14 Breads 122.3098 OUT049 1999
## 15 Hard Drinks 52.0298 OUT013 1987
## 16 Baking Goods 151.6366 OUT049 1999
## 17 Seafood 198.7768 OUT045 2002
## 18 Soft Drinks 192.2188 OUT018 2009
## 19 Household 109.6912 OUT027 1985
## 20 Snack Foods 193.7136 OUT010 1998
## Outlet_Size Outlet_Location_Type Outlet_Type
## 1 Medium Tier 1 Supermarket Type1
## 2 Tier 2 Supermarket Type1
## 3 Tier 3 Grocery Store
## 4 Tier 2 Supermarket Type1
## 5 Medium Tier 3 Supermarket Type3
## 6 Small Tier 1 Supermarket Type1
## 7 Medium Tier 3 Supermarket Type2
## 8 Medium Tier 3 Supermarket Type3
## 9 Tier 2 Supermarket Type1
## 10 Tier 2 Supermarket Type1
## 11 Tier 2 Supermarket Type1
## 12 Tier 2 Supermarket Type1
## 13 Small Tier 1 Grocery Store
## 14 Medium Tier 1 Supermarket Type1
## 15 High Tier 3 Supermarket Type1
## 16 Medium Tier 1 Supermarket Type1
## 17 Tier 2 Supermarket Type1
## 18 Medium Tier 3 Supermarket Type2
## 19 Medium Tier 3 Supermarket Type3
## 20 Tier 3 Grocery Store
# The following reads yahoo finance Facebook stock prices remotely
url = "https://query1.finance.yahoo.com/v7/finance/download/FB?period1=1577340844&period2=1608963244&interval=1d&events=history&includeAdjustedClose=true"
FB = read.csv(file = url)
head(FB, n = 15)
## Date Open High Low Close Adj.Close Volume
## 1 2019-12-26 205.57 207.82 205.31 207.79 207.79 9350700
## 2 2019-12-27 208.67 208.93 206.59 208.10 208.10 10284200
## 3 2019-12-30 207.86 207.90 203.90 204.41 204.41 10524300
## 4 2019-12-31 204.00 205.56 203.60 205.25 205.25 8953500
## 5 2020-01-02 206.75 209.79 206.27 209.78 209.78 12077100
## 6 2020-01-03 207.21 210.40 206.95 208.67 208.67 11188400
## 7 2020-01-06 206.70 212.78 206.52 212.60 212.60 17058900
## 8 2020-01-07 212.82 214.58 211.75 213.06 213.06 14912400
## 9 2020-01-08 213.00 216.24 212.61 215.22 215.22 13475000
## 10 2020-01-09 217.54 218.38 216.28 218.30 218.30 12642800
## 11 2020-01-10 219.20 219.88 217.42 218.06 218.06 12119400
## 12 2020-01-13 219.60 221.97 219.21 221.91 221.91 14463400
## 13 2020-01-14 221.61 222.38 218.63 219.06 219.06 13288900
## 14 2020-01-15 220.61 221.68 220.14 221.15 221.15 10036500
## 15 2020-01-16 222.57 222.63 220.39 221.77 221.77 10015300
# The following use the fread() function for reading the same Facebook data. f = fast and friendly
library(data.table) # Load the package first
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
FB2 = fread(input = url)
head(FB2, n = 15)
## Date Open High Low Close Adj Close Volume
## 1: 2019-12-26 205.57 207.82 205.31 207.79 207.79 9350700
## 2: 2019-12-27 208.67 208.93 206.59 208.10 208.10 10284200
## 3: 2019-12-30 207.86 207.90 203.90 204.41 204.41 10524300
## 4: 2019-12-31 204.00 205.56 203.60 205.25 205.25 8953500
## 5: 2020-01-02 206.75 209.79 206.27 209.78 209.78 12077100
## 6: 2020-01-03 207.21 210.40 206.95 208.67 208.67 11188400
## 7: 2020-01-06 206.70 212.78 206.52 212.60 212.60 17058900
## 8: 2020-01-07 212.82 214.58 211.75 213.06 213.06 14912400
## 9: 2020-01-08 213.00 216.24 212.61 215.22 215.22 13475000
## 10: 2020-01-09 217.54 218.38 216.28 218.30 218.30 12642800
## 11: 2020-01-10 219.20 219.88 217.42 218.06 218.06 12119400
## 12: 2020-01-13 219.60 221.97 219.21 221.91 221.91 14463400
## 13: 2020-01-14 221.61 222.38 218.63 219.06 219.06 13288900
## 14: 2020-01-15 220.61 221.68 220.14 221.15 221.15 10036500
## 15: 2020-01-16 222.57 222.63 220.39 221.77 221.77 10015300
A cheatsheet of ggplot2: https://rstudio.com/wp-content/uploads/2015/04/ggplot2-cheatsheet.pdf
You will need to practice the examples on your own in order to save the class time for other topics!
The structure of the code that creates a ggplot object in R is:
ggplot(data = ?, mapping = aes(x = ?, y = ?, color = ?, fill = ?, …)) + geom_*()
where * can be “line”, “point”, “bar”, “boxplot”, “density”, “dotplot”, “histogram”, “hline”, “vline”, “segment”, “text”, “smooth”, …
y = c(34, 56, 61, 78, 84, 92, 100, 120, 125)
x = c("1990-05-01", "1990-05-02", "1990-05-03", "1990-05-04", "1990-05-05", "1990-05-06", "1990-05-07", "1990-05-08", "1990-05-09")
dx = as.Date(x)
df = data.frame(x, y, dx)
library(ggplot2)
# Line plots
ggplot(data = df, mapping = aes(x=dx, y=y)) +
geom_line(size = 3, color = "yellow", linetype = "solid") +
scale_y_continuous(limits = c(40, 150)) + # Restrict the y-scale
labs(title = "Some Line Plot", x = "Date", y = "Price", subtitle = "A First Example on ggplot", caption = "Made by XYZ") +
theme(plot.title = element_text(hjust = 0.5, size = 20, color = "red", face = "bold"),
plot.subtitle = element_text(hjust = 0.5, size = 10, color = "blue", face = "italic"),
plot.caption = element_text(hjust = 1, size = 8, color = "green"),
axis.text.x=element_text(size=12, color = "red"),
axis.text.y=element_text(size=25, color = "pink", face = "bold"),
axis.title.x=element_text(size=14,face="bold"),
axis.title.y=element_text(size=30,face="bold"),
plot.background = element_rect(fill = "lightyellow",
colour = "lightblue",
size = 0.5, linetype = "dashed"),
panel.background = element_rect(fill = "lightgray",
colour = "lightblue",
size = 0.5, linetype = "solid"),
panel.grid.major = element_line(size = 0.5, linetype = 'solid',
colour = "purple"),
panel.grid.minor = element_line(size = 0.25, linetype = 'dotted',
colour = "violet")
)
## Warning: Removed 1 row(s) containing missing values (geom_path).
# Barplots
ggplot(data = iris, mapping = aes(x = Species, fill = Species)) +
geom_bar() +
theme(legend.position = "none")
# Density plot
ggplot(iris, aes(x = Sepal.Length, color = Species)) +
geom_density() +
labs(color = "Species") +
coord_flip()
ggplot(mtcars, aes(x = mpg, color = factor(cyl))) +
geom_density() +
labs(color = "Cyl") +
coord_flip()
# Violin plot
ggplot(iris, aes(y = Sepal.Length, x = Species)) +
geom_violin() +
labs(x = "Sepal Length")
titanic = as.data.frame(Titanic)
ggplot(titanic, aes(x = Survived, y = Freq, fill = Class)) +
geom_col(position = "dodge") # For aggregated (grouped) data, or geom_bar(stats = "identity)
ggplot(titanic, aes(x = Survived, y = Freq, fill = Class)) +
geom_col(position = "stack") +
labs(fill = "Classes")
# Faceting
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
facet_grid(cyl ~ .)
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
facet_grid(. ~ cyl) # The "." at left side can be omitted
# When there are many categories for the categorical variable
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
facet_wrap(. ~ manufacturer)
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
facet_grid(cyl ~ class)
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
facet_wrap(cyl ~ class)
# Making your plot interactive using the "plotly" package
# First store the ggplot object in a variable, called p here.
p = ggplot(mpg, aes(displ, hwy)) +
geom_point()
library(plotly) # Must load the package first
##
## Attaching package: 'plotly'
## The following object is masked from 'package:igraph':
##
## groups
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
ggplotly(p)
# Better scatterplots using the "GGally" package
plot(mtcars) # scatterplot matrix
cor(mtcars) # correlation matrix
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.68117191 -0.8676594
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479
## drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406
## wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000
## qsec 0.4186840 -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159
## vs 0.6640389 -0.8108118 -0.7104159 -0.7230967 0.44027846 -0.5549157
## am 0.5998324 -0.5226070 -0.5912270 -0.2432043 0.71271113 -0.6924953
## gear 0.4802848 -0.4926866 -0.5555692 -0.1257043 0.69961013 -0.5832870
## carb -0.5509251 0.5269883 0.3949769 0.7498125 -0.09078980 0.4276059
## qsec vs am gear carb
## mpg 0.41868403 0.6640389 0.59983243 0.4802848 -0.55092507
## cyl -0.59124207 -0.8108118 -0.52260705 -0.4926866 0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692 0.39497686
## hp -0.70822339 -0.7230967 -0.24320426 -0.1257043 0.74981247
## drat 0.09120476 0.4402785 0.71271113 0.6996101 -0.09078980
## wt -0.17471588 -0.5549157 -0.69249526 -0.5832870 0.42760594
## qsec 1.00000000 0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs 0.74453544 1.0000000 0.16834512 0.2060233 -0.56960714
## am -0.22986086 0.1683451 1.00000000 0.7940588 0.05753435
## gear -0.21268223 0.2060233 0.79405876 1.0000000 0.27407284
## carb -0.65624923 -0.5696071 0.05753435 0.2740728 1.00000000
# Two in one
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
mtcars %>% GGally::ggpairs()
A choropleth map (in Greek ‘choro’ means ‘area/region’ and ‘pletho’ means ‘multitude’) is a type of thematic map in which a set of pre-defined areas is colored or patterned in proportion to a statistical variable that represents an aggregate summary of a geographic characteristic within each area, such as population density or per-capita income. (From Wiki)
The following code is from an R documentation.
library(dplyr)
library(highcharter)
data("USArrests", package = "datasets")
data("usgeojson")
USArrests$state <- rownames(USArrests)
# Alternatively, USArrests <- mutate(USArrests, state = rownames(USArrests))
highchart() %>%
hc_title(text = "Violent Crime Rates by US State") %>%
hc_subtitle(text = "Source: USArrests data") %>%
hc_add_series_map(
map = usgeojson,
df = USArrests,
name = "Murder arrests (per 100,000)",
value = "Murder", joinBy = c("woename", "state"),
dataLabels = list(
enabled = TRUE,
format = "{point.properties.postalcode}"
)
) %>%
hc_colorAxis(stops = color_stops()) %>%
hc_legend(valueDecimals = 0, valueSuffix = "%") %>%
hc_mapNavigation(enabled = TRUE)
Refer to the page https://www.datanovia.com/en/lessons/highchart-interactive-world-map-in-r/ and use the code to practice.
# Load required R packages
#library(tidyverse)
# library(highcharter)
# Retrieve life expectancy data for the year 2015
library(dplyr)
life.exp <- read.csv("/Users/home/Documents/Zhang/Stat415.515.615/lifeExpectancy.csv")
life.exp <- life.exp %>%
filter(Year == 2019)
head(life.exp)
## Entity Code Year Life.expectancy
## 1 Afghanistan AFG 2019 64.83300
## 2 Africa 2019 63.17000
## 3 Albania ALB 2019 78.57300
## 4 Algeria DZA 2019 76.88000
## 5 American Samoa ASM 2019 73.74500
## 6 Americas 2019 76.83539
# Load the world Map data
data(worldgeojson, package = "highcharter")
hc <- highchart() %>%
hc_add_series_map(
worldgeojson, life.exp, value = "Life.expectancy", joinBy = c('iso3', "Code"),
name = "LifeExpectancy"
) %>%
hc_colorAxis(stops = color_stops()) %>%
hc_title(text = "World Map") %>%
hc_subtitle(text = "Life Expectancy in 2019")
hc
A heat map (or heatmap) is a data visualization technique that shows magnitude of a phenomenon as color in two dimensions. The variation in color may be by hue or intensity, giving obvious visual cues to the reader about how the phenomenon is clustered or varies over space. (By Wiki)
Refer to: https://www.r-graph-gallery.com/215-the-heatmap-function.html
# The mtcars dataset:
data <- as.matrix(mtcars) # The heatmap function takes as input a matrix.
# Default Heatmap
heatmap(data)
The heatmap just generated is not very insightful: all the variation is absorbed by the “hp” and “disp” variables that have very high values compared to the others. We need to normalize the data.
Normalizing the matrix is done using the scale argument of the heatmap() function. It can be applied to row or to column. Here the column option is chosen, since we need to absorb the variation between column.
# Use 'scale' to normalize
heatmap(data, scale="column")
# No dendrogram nor reordering for neither column or row
heatmap(data, Colv = NA, Rowv = NA, scale="column")
A note: to scale a dataset on columns, use the scale() function in R, as shown below.
scaled.mtcars = scale(mtcars)
# Check
apply(scaled.mtcars, 2, mean) # All column means are now basically zero
## mpg cyl disp hp drat
## 7.112366e-17 -1.474515e-17 -9.084937e-17 1.040834e-17 -2.918672e-16
## wt qsec vs am gear
## 4.681043e-17 5.299580e-16 6.938894e-18 4.510281e-17 -3.469447e-18
## carb
## 3.165870e-17
apply(scaled.mtcars, 2, sd) # All column standard deviations are now one
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 1 1 1 1 1 1 1 1 1 1
For correlation,
# No dendrogram at all
heatmap(cor(mtcars), Colv = NA, Rowv = NA) # Using correlation, so No need to scale
library(gplots)
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
heatmap.2(cor(mtcars), Colv = FALSE, Rowv = FALSE, , cellnote = round(cor(mtcars),2), notecol = "black", dendrogram = "none", trace="none", key = FALSE)
For missing values,
w = c(NA, 7, NA, 34, 12, NA, 44, 21, 26, 56, NA, 45, 34, 12)
x = c( 9, 12, 23, 31, NA, 24, 13, 26, NA, 43, NA, NA, 34, NA)
y = c(32, 11, 7, NA, 8, 2, 3, NA, 2, 8, 12, 21, 54, NA)
z = c(41, 23, NA, 51, 52, 43, NA, 31, NA, 34, 31, NA, 33, NA)
df = data.frame(w, x,y,z)
# What is the difference between the results of the following 2 lines of code?
is.na(x)
## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE
## [13] FALSE TRUE
is.na(x)*1
## [1] 0 0 0 0 1 0 0 0 1 0 1 1 0 1
# Generate a heatmap that displays missing values
heatmap(is.na(df)*1, Colv = NA, Rowv = NA)
A treemap is a space-filling visualization of hierarchical structures. The map is a set of nested rectangles. Each group is represented by a rectangle.
Here is a blog about treemaps: https://www.r-bloggers.com/2018/09/simple-steps-to-create-treemap-in-r/
# The code is from R documentation.
library(treemap)
data(GNI2014)
treemap(
dtf = GNI2014,
index=c("continent", "iso3"),
vSize="population",
vColor="GNI",
type="value",
format.legend = list(scientific = FALSE, big.mark = " "))