Load libraries wither with library()
or require()
.
String construction.
dog <- "Chester"
print(paste("you are a dog", dog))
## [1] "you are a dog Chester"
nchar(dog)
## [1] 7
Create a vector with the combine function c()
. Reference vector elements with brackets, or with element names. R compares vectors element-wise. If you compare a vector to a singe value, R will create an appropriately sized vector.
There are two types of vectors in R: atomic vectors, and lists. Atomic vectors are homogenous of one of six types: logical, integer, double, character, complex, and raw (don’t worry about the relatively uncommon complex and raw types). Lists are recursive vectors (they can contain other lists).
Vectors have two key properties: type typeof()
of length length()
. Subset a list with single brackets and extract elements with double brackets. For example,
a <- list(
a = 1:3,
b = "a string",
c = pi,
d = list(-1, -5)
)
# List d.
typeof(a[4])
## [1] "list"
# The two elements of list d.
typeof(a[[4]])
## [1] "list"
# The first element of list d.
typeof(a[[4]][1])
## [1] "list"
# The first value of list d
typeof(a[[4]][[1]])
## [1] "double"
numeric_vector <- c(1, 10, 49)
character_vector <- c("a", "b", "c")
boolean_vector <- c(TRUE, FALSE, TRUE)
character_vector[1]
## [1] "a"
boolean_vector[c(2,3)]
## [1] FALSE TRUE
boolean_vector[2:3]
## [1] FALSE TRUE
roulette_vector <- c(-24, -50, 100, -350, 10)
names(roulette_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
roulette_vector[1]
## Monday
## -24
roulette_vector["Monday"]
## Monday
## -24
# vector operations
sum(roulette_vector)
## [1] -314
mean(roulette_vector)
## [1] -62.8
# take a subset of a vector using booleans
roulette_vector[roulette_vector>0]
## Wednesday Friday
## 100 10
A matrix is a two-dimensional collection of elements. Create a matrix with the matrix(data, nrow, ncol, byrow)
function. Label the rows with rownames()
and the columns with colnames()
. Sum each row and column into vectors with rowSums()
and colSums()
. Bind rows and columns to a matrix with rbind()
and cbind()
. Reference matrix items with brackets [row, col].
# Matrix of numbers 1:20, filling one row at a time, for 5 rows and 4 columns. Specifying the number of columns is optional if number of rows is specified.
m <- matrix(1:20, byrow = TRUE, nrow = 5, ncol = 4)
rownames(m) <- c("row 1", "row 2", "row 3", "row 4", "row 5")
colnames(m) <- c("Col 1", "col 2", "col 3", "col 4")
m
## Col 1 col 2 col 3 col 4
## row 1 1 2 3 4
## row 2 5 6 7 8
## row 3 9 10 11 12
## row 4 13 14 15 16
## row 5 17 18 19 20
# Bind row sums to matrix.
m.rowSum <- rowSums(m)
cbind(m, m.rowSum)
## Col 1 col 2 col 3 col 4 m.rowSum
## row 1 1 2 3 4 10
## row 2 5 6 7 8 26
## row 3 9 10 11 12 42
## row 4 13 14 15 16 58
## row 5 17 18 19 20 74
# All rows of the second colum of m.
m[,2]
## row 1 row 2 row 3 row 4 row 5
## 2 6 10 14 18
Use nrows()
and ncols()
to determine number of rows and columns.
for (i in 1:nrow(m)) {
for (j in 1:ncol(m)) {
print(paste("On row ", i, " and column ", j, " the matrix contains ", m[i,j]))
}
}
## [1] "On row 1 and column 1 the matrix contains 1"
## [1] "On row 1 and column 2 the matrix contains 2"
## [1] "On row 1 and column 3 the matrix contains 3"
## [1] "On row 1 and column 4 the matrix contains 4"
## [1] "On row 2 and column 1 the matrix contains 5"
## [1] "On row 2 and column 2 the matrix contains 6"
## [1] "On row 2 and column 3 the matrix contains 7"
## [1] "On row 2 and column 4 the matrix contains 8"
## [1] "On row 3 and column 1 the matrix contains 9"
## [1] "On row 3 and column 2 the matrix contains 10"
## [1] "On row 3 and column 3 the matrix contains 11"
## [1] "On row 3 and column 4 the matrix contains 12"
## [1] "On row 4 and column 1 the matrix contains 13"
## [1] "On row 4 and column 2 the matrix contains 14"
## [1] "On row 4 and column 3 the matrix contains 15"
## [1] "On row 4 and column 4 the matrix contains 16"
## [1] "On row 5 and column 1 the matrix contains 17"
## [1] "On row 5 and column 2 the matrix contains 18"
## [1] "On row 5 and column 3 the matrix contains 19"
## [1] "On row 5 and column 4 the matrix contains 20"
The factor()
function converts a variable into type factor. R needs to know whether a variable is continuous or categorical. To specify an ordinal categorical variable, specify order = TRUE
and levels
.
student_status <- c("student", "not student", "student", "not student")
categorical_student <- factor(student_status)
categorical_student
## [1] student not student student not student
## Levels: not student student
temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
temperature_vector
## [1] "High" "Low" "High" "Low" "Medium"
# nominal variables are not comparable, but ordinal variables are.
temperature_vector[1] > temperature_vector[2]
## [1] FALSE
factor_temperature_vector[1] > factor_temperature_vector[2]
## [1] TRUE
# Change the level names with the levels function. Note the levels are initially in alphabetical order.
survey_vector <- c("M", "F", "F", "M", "M")
factor_survey_vector <- factor(survey_vector)
levels(factor_survey_vector) <- c("Female", "Male")
# Notice how summary treats a factor variable different from a regular variable.
summary(survey_vector)
## Length Class Mode
## 5 character character
summary(factor_survey_vector)
## Female Male
## 2 3
A dataframe is like a matrix, except each column can be a different data type. Several functions inspect data frames. * head
(tail
): by default prints the first (last) 6 rows of the dataframe * str
: prints the structure of the dataframe. Probably the first function you’ll call with a new data set. * dim
: prints the dimensions of the dataframe * colnames
: prints the column names of the dataframe * na.omit()
removes rows with NA in any column.
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
head(mtcars,6)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
dim(mtcars)
## [1] 32 11
colnames(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
Create a data frame with the data.frame()
function.
planets <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
planets_df <- data.frame(planets, type, diameter, rotation, rings)
# Select first 5 values of diameter column. The $ is a short-cut method.
planets_df[1:5,"diameter"]
## [1] 0.382 0.949 1.000 0.532 11.209
planets_df$diameter[1:5]
## [1] 0.382 0.949 1.000 0.532 11.209
Use subset()
to apply a where condition to the data frame rows. User order()
to apply an order by to the data frame.
subset(planets_df, subset = diameter < 1)
## planets type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
planets_df[order(planets_df$diameter),]
## planets type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 8 Neptune Gas giant 3.883 0.67 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 5 Jupiter Gas giant 11.209 0.41 TRUE
Construct a list of objects with list()
. Name the list items either with “=” at creation, or using names()
.
my_vector <- 1:10
my_matrix <- matrix(1:9, ncol = 3)
my_df <- mtcars[1:10,]
my_list <- list(my_vector, my_matrix, my_df)
names(my_list) <- c("vec", "mat", "df")
my_list
## $vec
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $mat
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
##
## $df
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
# or
my_list2 <- list(vec = my_vector, mat = my_matrix, df = my_df)
my_list2
## $vec
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $mat
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
##
## $df
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Reference items in a list by its component number in brackets, or name in brackets, or name after a dollar sign.
my_vector <- 1:10
my_matrix <- matrix(1:9, ncol = 3)
my_df <- mtcars[1:10,]
my_list <- list(vec = my_vector, mat = my_matrix, df = my_df)
# Third col of second element of my_list (my_matrix)
my_list[[2]][,3]
## [1] 7 8 9
my_list$mat[,3]
## [1] 7 8 9
Append to a list with combine c()
.
my_vector <- 1:10
my_matrix <- matrix(1:9, ncol = 3)
my_df <- mtcars[1:10,]
my_list <- list(vec = my_vector, mat = my_matrix, df = my_df)
my_list <- c(my_list, df2 = my_df)
Relational operators are ==
and !=
. Logical operators are &
, |
, and !
. Be careful to not use &&
or !!
- they evaluate only the first item in the list! Control constructs are if()
.
x <- 3
if (x %% 2 == 0) {
print("x is divisible by 2")
} else if (x %% 3 == 0) {
print("x is divisible by 3")
} else {
print("x is divisible by neither 2 nor 3")
}
## [1] "x is divisible by 3"
While loop is while() {}
. Break out of loop early with if (condition) { break()}
.
i <- 1
while (i <= 10) {
print(3 * i)
if (3 * i %% 8 == 0) {
break()
}
i <- i + 1
}
## [1] 3
## [1] 6
## [1] 9
## [1] 12
## [1] 15
## [1] 18
## [1] 21
## [1] 24
For loop is for(var in seq) {exp}
. The break
statement abandons the active loop. The next
statement skips the rest of the statements in the current loop interation.
linkedin <- c(16, 9, 13, 5, 2, 17, 14)
# Loop version 1
for(views in linkedin) {
print(views)
if (views > 10) {
break
} else if (view < 5) {
next
}
}
## [1] 16
# Loop version 2
for(i in 1:length(linkedin)) {
print(linkedin[i])
}
## [1] 16
## [1] 9
## [1] 13
## [1] 5
## [1] 2
## [1] 17
## [1] 14
# seq_along handles zero-length vectors and lists.
for (i in seq_along(linkedin)) {
print(linkedin[i])
}
## [1] 16
## [1] 9
## [1] 13
## [1] 5
## [1] 2
## [1] 17
## [1] 14
Get help on a function with help
or ?
, or args
. Specify function parameters either by name or by position. When the documentation specifies default values, they are not required.
#help(mean)
#?mean
args(mean)
## function (x, ...)
## NULL
grades <- c(8.5, 7, 9, 5.5, 6)
mean(x=grades)
## [1] 7.2
mean(grades)
## [1] 7.2
Define a custom function with the function()
code chunk. The return
statement returns and exits immediately and is optional. Set default argument value with =
.
multiply_a_b <- function(a, b = 1) {
return (a * b)
}
result <- multiply_a_b(a = 3, b = 7)
Install a package with install.packages(arg)
. Packages are located at the Comprehensive R Archive Network (CRAN). Search for packages with search()
. R attaches seven packages to its search list by default. Attach more packages with library()
or require()
.
Function lapply(X, FUN, ...)
applies a function to a list. lapply()
returns a list, so if X
is a vector, cast the function result back to list with unlist
. If the function requires arguments, pass them in as additional arguments to lapply()
. Functions can be named or anonymous, so if used only once, define the function within lapply()
.
lapply(list(1,2,3), function(x) { 3 * x })
## [[1]]
## [1] 3
##
## [[2]]
## [1] 6
##
## [[3]]
## [1] 9
Function sapply()
calls lapply()
then converts the list to a one-dimensional array (vector) or two-dimensional array (matrix). If sapply
cannot simplify because the resulting list contains vectors of varying lengths, then sapply()
returns the same result as lapply()
.
Function vapply()
uses lapply()
but with FUN.VALUE
which indicates the return variable type. vapply()
is a safe alternative to sapply()
.
purrr
PackageThe purrr
package maps functions to a vector and return a vector. map()
returns a list; the others are map_dbl()
, map_lgl()
, map_int()
, and map_chr()
. The purrr
functions provide shortcuts for the f argument, are more consistant than lapply and sapply, and handle iteration well.
library(purrr)
## Warning: package 'purrr' was built under R version 3.4.4
cyl <- split(mtcars, mtcars$cyl)
# Regress mpg ~ wt on each cylinder class
map(cyl, function(df) lm(mpg ~ wt, data = df))
## $`4`
##
## Call:
## lm(formula = mpg ~ wt, data = df)
##
## Coefficients:
## (Intercept) wt
## 39.571 -5.647
##
##
## $`6`
##
## Call:
## lm(formula = mpg ~ wt, data = df)
##
## Coefficients:
## (Intercept) wt
## 28.41 -2.78
##
##
## $`8`
##
## Call:
## lm(formula = mpg ~ wt, data = df)
##
## Coefficients:
## (Intercept) wt
## 23.868 -2.192
# Same thing with shortcuts
models <- map(cyl, ~ lm(mpg ~ wt, data = .))
coefs <- map(models, coef)
map(coefs, "wt")
## $`4`
## [1] -5.647025
##
## $`6`
## [1] -2.780106
##
## $`8`
## [1] -2.192438
# Or, using a single command with pipes.
mtcars %>%
split(mtcars$cyl) %>%
map(~ lm(mpg ~ wt, data = .)) %>%
map(coef) %>%
map_dbl("wt")
## 4 6 8
## -5.647025 -2.780106 -2.192438
The safely()
function returns a list with two elements: result and error for each element. possibly()
returns a default value on errors. quietly()
captures all printed output, messages, and warnings instead of capturing errors.
safe_readLines <- safely(readLines())
# Call safe_readLines() on "http://example.org"
example_lines <- safe_readLines("http://example.org")
example_lines
## $result
## NULL
##
## $error
## NULL
# Call safe_readLines() on "http://asdfasdasdkfjlda"
nonsense_lines <- safe_readLines("http://asdfasdasdkfjlda")
nonsense_lines
## $result
## NULL
##
## $error
## NULL
n <- list(5, 10, 20)
mu <- list(1, 5, 10)
sd <- list(0.1, 1, 0.1)
# iterate over the lists
pmap(list(n, mu, sd), rnorm)
## [[1]]
## [1] 1.0380868 0.9605489 1.0786154 1.0073599 1.0234126
##
## [[2]]
## [1] 4.343431 6.307386 3.939620 3.125216 7.622740 5.457172 5.548574
## [8] 4.371869 4.627905 5.260454
##
## [[3]]
## [1] 10.053020 10.053259 10.119406 9.824395 9.995872 9.749677 9.997900
## [8] 10.128129 10.115909 10.197187 10.031033 10.080599 9.935449 10.055783
## [15] 10.083899 9.935934 9.781156 10.215975 10.060304 10.016733
funs <- list("rnorm", "runif", "rexp")
rnorm_params <- list(mean = 10)
runif_params <- list(min = 0, max = 5)
rexp_params <- list(rate = 5)
params <- list(
rnorm_params,
runif_params,
rexp_params
)
# Call invoke_map() on funs supplying params and setting n to 5
invoke_map(funs, params, n = 5)
## [[1]]
## [1] 9.657600 12.019679 10.136912 11.521788 9.658688
##
## [[2]]
## [1] 1.0613833 2.0008371 1.4973380 2.9227932 0.3804437
##
## [[3]]
## [1] 0.07188987 0.07739475 0.03476835 0.33302093 0.17282787
walk()
operates just like map()
except it’s designed for functions that don’t return anything. Use walk()
for functions with side effects like printing, plotting or saving.
#?walk2
stopifnot()
is a quick way to stop a function stop if a condition fails. stopifnot() takes logical expressions as arguments and looks for any to be FALSE
.
x <- c(NA, NA, NA)
y <- c( 1, NA, NA, NA)
both_na <- function(x, y) {
stopifnot(length(x) == length(y))
sum(is.na(x) & is.na(y))
}
#both_na(x, y)
Use stop()
instead of stopifnot()
to specify a more informative error message.
x <- c(NA, NA, NA)
y <- c( 1, NA, NA, NA)
both_na <- function(x, y) {
if (length(x) != length(y)) {
stop("x and y must have the same length", call. = FALSE)
}
sum(is.na(x) & is.na(y))
}
#both_na(x, y)
R features a bunch of functions to juggle around with data structures:: seq(from = 1, to 2, by = .25)
: Generates sequence from
1 to
2 incremented by
.25. rep(x, times)
: Replicates elements of vectors and lists. sort(x)
: Sorts a vector. rev(x)
: Reverses the elements in a data structures for which reversal is defined. str(x)
: Display the structure of any R object x
. append(x y)
: Appends vectors or list y
to x
. is.*()
: Checks class of R object x
. as.*()
: Casts R object x
. unlist(x)
: Flatten (possibly embedded) lists to produce a vector.
myseq <- seq(8, 2, by=-2)
myseq
## [1] 8 6 4 2
myrep <- rep(myseq, times =2)
myrep
## [1] 8 6 4 2 8 6 4 2
myrep <- rep(myseq, each = 2)
myrep
## [1] 8 8 6 6 4 4 2 2
linkedin <- list(16, 9, 13, 5, 2, 17, 14)
facebook <- list(17, 7, 5, 16, 8, 13, 14)
li_vec <- unlist(linkedin)
fb_vec <- unlist(facebook)
social_vec <- append(li_vec, fb_vec)
sort(social_vec, decreasing = TRUE)
## [1] 17 17 16 16 14 14 13 13 9 8 7 5 5 2
Regular expressions include grepl() grepl(pattern = "a", x = animals)
returns TRUE for each element of x
matching the pattern
. Regular expression “^a” means a*; “a$” means *a; .\*
means any character zero or more times; ’\smeans space;
[0-9]+means numbers 0 to 9 at least once.
grep(pattern = “a”, x = animals)returns the vector indices for each element of
xmatching the
pattern.
sub(pattern = “a”, replacement = “o”, x = animals“)substitutes the first a with o.
gsum(pattern =”a“, replacement =”o“, x = animals”)` substitutes all a’s with o’s.)
animals <- c("cat", "moose", "impala", "ant", "kiwi")
grepl(pattern = "a", x = animals)
## [1] TRUE FALSE TRUE TRUE FALSE
which(grepl(pattern = "a", x = animals))
## [1] 1 3 4
grep(pattern = "a", x = animals)
## [1] 1 3 4
There are two datetimes in R, POSIXlt
, a list with named components, and POSIXct
, the number of seconds since 1970-01-01 00:00:00. POSIXct
is more amenable to data frames, so you will encounter it much more often. Sys.Date()
returns a Date
equal to today. Sys.time()
returns POSIXct
.
as.Date("2018-10-16")
## [1] "2018-10-16"
as.POSIXct("2018-11-28 08:34:00")
## [1] "2018-11-28 08:34:00 EST"
The simplest file to import is RData.
url_rdata <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData"
download.file(url_rdata, "Programs/Data/wine_local.RData")
# loading wine_local.RData creates variable wine.
load("Programs/Data/wine_local.RData")
summary(wine)
## Alcohol Malic acid Ash Alcalinity of ash
## Min. :11.03 Min. :0.74 Min. :1.360 Min. :10.60
## 1st Qu.:12.36 1st Qu.:1.60 1st Qu.:2.210 1st Qu.:17.20
## Median :13.05 Median :1.87 Median :2.360 Median :19.50
## Mean :12.99 Mean :2.34 Mean :2.366 Mean :19.52
## 3rd Qu.:13.67 3rd Qu.:3.10 3rd Qu.:2.560 3rd Qu.:21.50
## Max. :14.83 Max. :5.80 Max. :3.230 Max. :30.00
## Magnesium Total phenols Flavanoids Nonflavanoid phenols
## Min. : 70.00 Min. :0.980 Min. :0.340 Min. :0.1300
## 1st Qu.: 88.00 1st Qu.:1.740 1st Qu.:1.200 1st Qu.:0.2700
## Median : 98.00 Median :2.350 Median :2.130 Median :0.3400
## Mean : 99.59 Mean :2.292 Mean :2.023 Mean :0.3623
## 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.860 3rd Qu.:0.4400
## Max. :162.00 Max. :3.880 Max. :5.080 Max. :0.6600
## Proanthocyanins Color intensity Hue Proline
## Min. :0.410 Min. : 1.280 Min. :1.270 Min. : 278.0
## 1st Qu.:1.250 1st Qu.: 3.210 1st Qu.:1.930 1st Qu.: 500.0
## Median :1.550 Median : 4.680 Median :2.780 Median : 672.0
## Mean :1.587 Mean : 5.055 Mean :2.604 Mean : 745.1
## 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:3.170 3rd Qu.: 985.0
## Max. :3.580 Max. :13.000 Max. :4.000 Max. :1680.0
# or, equivalently,
load(url("https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData"))
summary(wine)
## Alcohol Malic acid Ash Alcalinity of ash
## Min. :11.03 Min. :0.74 Min. :1.360 Min. :10.60
## 1st Qu.:12.36 1st Qu.:1.60 1st Qu.:2.210 1st Qu.:17.20
## Median :13.05 Median :1.87 Median :2.360 Median :19.50
## Mean :12.99 Mean :2.34 Mean :2.366 Mean :19.52
## 3rd Qu.:13.67 3rd Qu.:3.10 3rd Qu.:2.560 3rd Qu.:21.50
## Max. :14.83 Max. :5.80 Max. :3.230 Max. :30.00
## Magnesium Total phenols Flavanoids Nonflavanoid phenols
## Min. : 70.00 Min. :0.980 Min. :0.340 Min. :0.1300
## 1st Qu.: 88.00 1st Qu.:1.740 1st Qu.:1.200 1st Qu.:0.2700
## Median : 98.00 Median :2.350 Median :2.130 Median :0.3400
## Mean : 99.59 Mean :2.292 Mean :2.023 Mean :0.3623
## 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.860 3rd Qu.:0.4400
## Max. :162.00 Max. :3.880 Max. :5.080 Max. :0.6600
## Proanthocyanins Color intensity Hue Proline
## Min. :0.410 Min. : 1.280 Min. :1.270 Min. : 278.0
## 1st Qu.:1.250 1st Qu.: 3.210 1st Qu.:1.930 1st Qu.: 500.0
## Median :1.550 Median : 4.680 Median :2.780 Median : 672.0
## Mean :1.587 Mean : 5.055 Mean :2.604 Mean : 745.1
## 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:3.170 3rd Qu.: 985.0
## Max. :3.580 Max. :13.000 Max. :4.000 Max. :1680.0
There are three common packages designed to load flat files: util
which comes with base r, readr
, and data.table
.
util
The base r util
package includes flat file reading functions. read.table()
is a generic flat file loading function. Wrapper functions read.csv()
reads comma-separated files, and read.delim
reads tab-delimited files.
stringsAsFactors = TRUE
treats string variables as categorical.col.names = c()
overrides, or sets, column names.colClasses = c()
sets data types. NULL elements in the vector drop the variable.# Opt 1: set working dir to file location
# setwd("C:/Users/mpfol/OneDrive/Documents/Data Analysis/Data")
# Opt 2: define a file path relative to script file.
path <- file.path("Data", "swimming_pools.csv")
swimming_pools <- read.csv(path, stringsAsFactors = FALSE)
swimming_pools <- read.table(path,
sep = ",",
header = TRUE,
col.names = c("name", "address", "ph", "ph2", "open_hr","facilities", "disabl","park","lat","longit"),
colClasses = c("factor", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "numeric", "numeric"))
readr
readr
is similar to utils
, but is faster and less verbose. readr
returns a “tibble” instead of a data frame. Functions read_csv()
and read_tsv()
are wrappers for read_delim()
, similar to the construction in package utils
.
col_names = TRUE
sets column names to the first row of data. Set col_names = FALSE
for system-generated names or set col_names = c()
to set the column names to a character vector.col_types = c()
sets data types. NULL elements in the vector drop the variable. Use shorthand strings where col_types = "cd_il")
means “character, double, (skip), integer, logical”.col_factor()
and col_integer()
also set column types.library(readr)
pools <- file.path("Programs/Data", "swimming_pools.csv")
# or, if on the web,
pools.path <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"
pools <- read_csv(pools.path)
## Parsed with column specification:
## cols(
## Name = col_character(),
## Address = col_character(),
## Latitude = col_double(),
## Longitude = col_double()
## )
potatoes.path <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/potatoes.txt"
potatoes <- read_delim(potatoes.path, delim = "\t")
## Parsed with column specification:
## cols(
## area = col_integer(),
## temp = col_integer(),
## size = col_integer(),
## storage = col_integer(),
## method = col_integer(),
## texture = col_double(),
## flavor = col_double(),
## moistness = col_double()
## )
machine <- file.path("Programs/Data", "machine.txt")
properties <- c("new", "old")
machine.fragment <- read_tsv(machine, skip = 6, n_max = 5,
col_names = properties)
## Parsed with column specification:
## cols(
## new = col_double(),
## old = col_double()
## )
hotdogs <- file.path("Programs/Data", "hotdogs.txt")
hotdogs_factor <- read_tsv(hotdogs,
col_names = c("type", "calories", "sodium"),
skip = 1)
## Parsed with column specification:
## cols(
## type = col_character(),
## calories = col_double(),
## sodium = col_double()
## )
data.table
The data.table
package is optimized for large files. fread()
is faster and more convenient than read.table
.
library(data.table)
## Warning: package 'data.table' was built under R version 3.4.4
##
## Attaching package: 'data.table'
## The following object is masked from 'package:purrr':
##
## transpose
pools <- file.path("Programs/Data", "swimming_pools.csv")
machine <- file.path("Programs/Data", "machine.txt")
properties <- c("new", "old")
machine.fragment <- fread(machine)
There are three packages to choose from, readxl
, gdata
, and XLConnect
. gdata
only handles .xls files and will be replaced when readxl
is more mature. XLConnect
is designed to work with Excel through R.
readxl
readxl
cannot read directly from the internet. First download the file, then import the file.
Packagage readxl
functions excel_sheets()
lists the available sheets, read_excel()
reads the file.
col_names = TRUE
sets column names to the first row of data. Set col_names = FALSE
for system-generated names or set col_names = c()
to set the column names to a character vector.col_types = c()
sets data types. “blank” elements in the vector drop the variable.skip
skips lines. If first line is column names, you will have to manually set it.library(readxl)
## Warning: package 'readxl' was built under R version 3.4.4
url_xls <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls"
download.file(url_xls, file.path("Programs/Data", "local_latitude.xls"))
#excel_readxl <- read_excel(file.path("C:/Users/mpfol/OneDrive/Documents/Data Analysis/Programs/Data", "local_latitude.xls"))
mini.path <- file.path("Programs/Data", "MinitabIntroData.xlsx")
excel_sheets(mini.path)
## [1] "Sheet1" "Sheet2"
sheet1 <- read_excel(mini.path, sheet = "Sheet1")
sheet2 <- read_excel(mini.path, sheet = "Sheet2")
sheet.list = list(sheet1, sheet2)
# Equivalently...
sheet.list <- lapply(excel_sheets(mini.path),
read_excel, path = mini.path)
gdata
gdata
requires perl in the background. It can only read .xls
files. It can read directly from web sites though.
library(gdata)
## Warning: package 'gdata' was built under R version 3.4.4
## gdata: Unable to locate valid perl interpreter
## gdata:
## gdata: read.xls() will be unable to read Excel XLS and XLSX files
## gdata: unless the 'perl=' argument is used to specify the location
## gdata: of a valid perl intrpreter.
## gdata:
## gdata: (To avoid display of this message in the future, please
## gdata: ensure perl is installed and available on the executable
## gdata: search path.)
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLX' (Excel 97-2004) files.
##
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLSX' (Excel 2007+) files.
##
## gdata: Run the function 'installXLSXsupport()'
## gdata: to automatically download and install the perl
## gdata: libaries needed to support Excel XLS and XLSX formats.
##
## Attaching package: 'gdata'
## The following objects are masked from 'package:data.table':
##
## first, last
## The following object is masked from 'package:purrr':
##
## keep
## The following object is masked from 'package:stats':
##
## nobs
## The following object is masked from 'package:utils':
##
## object.size
## The following object is masked from 'package:base':
##
## startsWith
url_xls <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls"
#read.xls(url_xls)
#library(XLConnect)
mini.path <- file.path("Programs/Data", "MinitabIntroData.xlsx")
#my_book <- loadWorkbook(mini.path)
#class(my_book)
#getSheets(my_book)
#readWorksheet(my_book, sheet = 2)
#all <- lapply(sheets, readWorksheet, object = my_book)
#str(all)
#createSheet(my_book, name = "year_2010")
#writeWorksheet(my_book, pop_2010, sheet = "year_2010")
#saveWorkbook(my_book, file = "MinitabIntroData2.xlsx")
There is a dedicated package for each DBMS: RMySQL
, RPostgresSQL
, ROracle
, etc. Function dbGetQuery()
is a convenient aggregator of three functions, dbSendQuery()
, dbFetch()
, and dbClearResults()
. Use the three functions if the data set is large and only a chunk of data is needed at a time.
library(DBI)
## Warning: package 'DBI' was built under R version 3.4.4
con <- dbConnect(RMySQL::MySQL(),
dbname = "tweater",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
con
## <MySQLConnection:0,0>
# read all tables into a list of data frames
table_names <- dbListTables(con)
tables <- lapply(table_names, dbReadTable, conn = con)
# read an entire table, then subset the rows you want (inefficient)
comments <- dbReadTable(con, "comments")
subset(comments,
subset = user_id == 1,
tweat_id = 77)
## id tweat_id user_id message
## 4 1012 87 1 awesome! thanks!
## 7 1004 49 1 this is fabulous!
## 11 1020 77 1 couldn't be better
## 12 1014 77 1 saved my day
elisabeth <- dbGetQuery(con, "SELECT tweat_id FROM comments
WHERE user_id = 1")
latest <- dbGetQuery(con, "SELECT post FROM tweats WHERE date > \"2015-09-21\"")
dbDisconnect(con)
## [1] TRUE
If a file resides on the web, reference it directly instead of manually downloading. For the excel
package, you will have to first download the file.
url = "http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r"
dest_path = file.path("~", "local_cities.xlsx")
#download.file(url, dest_path)
The httr
package also handles internet files.
library(httr)
## Warning: package 'httr' was built under R version 3.4.4
resp <- GET("http://www.example.com/")
raw_content <- content(resp, as = "raw")
head(raw_content)
## [1] 3c 21 64 6f 63 74
JSON files are either name-value pair objects {“id”:1,“name”:“Frank”}, or arrays [1,2,3,“dog”].
library(jsonlite)
## Warning: package 'jsonlite' was built under R version 3.4.4
##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
## flatten
wine_json <- '{"name":"Chateau Migraine", "year":1997, "alcohol_pct":12.4, "color":"red", "awarded":false}'
# Convert file JSON into list
wine <- fromJSON(wine_json)
str(wine)
## List of 5
## $ name : chr "Chateau Migraine"
## $ year : int 1997
## $ alcohol_pct: num 12.4
## $ color : chr "red"
## $ awarded : logi FALSE
# Convert web API JSON into list
url_sw4 <- "http://www.omdbapi.com/?apikey=ff21610b&i=tt0076759&r=json"
url_sw3 <- "http://www.omdbapi.com/?apikey=ff21610b&i=tt0121766&r=json"
# Import two URLs with fromJSON(): sw4 and sw3
#sw4 <- fromJSON(url_sw4)
#sw3 <- fromJSON(url_sw3)
# Print the Title element of both lists
#sw4$Title
#sw3$Title
# Convert mtcars to a pretty JSON: pretty_json
pretty_json <- toJSON(mtcars, pretty = TRUE)
pretty_json
## [
## {
## "mpg": 21,
## "cyl": 6,
## "disp": 160,
## "hp": 110,
## "drat": 3.9,
## "wt": 2.62,
## "qsec": 16.46,
## "vs": 0,
## "am": 1,
## "gear": 4,
## "carb": 4,
## "_row": "Mazda RX4"
## },
## {
## "mpg": 21,
## "cyl": 6,
## "disp": 160,
## "hp": 110,
## "drat": 3.9,
## "wt": 2.875,
## "qsec": 17.02,
## "vs": 0,
## "am": 1,
## "gear": 4,
## "carb": 4,
## "_row": "Mazda RX4 Wag"
## },
## {
## "mpg": 22.8,
## "cyl": 4,
## "disp": 108,
## "hp": 93,
## "drat": 3.85,
## "wt": 2.32,
## "qsec": 18.61,
## "vs": 1,
## "am": 1,
## "gear": 4,
## "carb": 1,
## "_row": "Datsun 710"
## },
## {
## "mpg": 21.4,
## "cyl": 6,
## "disp": 258,
## "hp": 110,
## "drat": 3.08,
## "wt": 3.215,
## "qsec": 19.44,
## "vs": 1,
## "am": 0,
## "gear": 3,
## "carb": 1,
## "_row": "Hornet 4 Drive"
## },
## {
## "mpg": 18.7,
## "cyl": 8,
## "disp": 360,
## "hp": 175,
## "drat": 3.15,
## "wt": 3.44,
## "qsec": 17.02,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 2,
## "_row": "Hornet Sportabout"
## },
## {
## "mpg": 18.1,
## "cyl": 6,
## "disp": 225,
## "hp": 105,
## "drat": 2.76,
## "wt": 3.46,
## "qsec": 20.22,
## "vs": 1,
## "am": 0,
## "gear": 3,
## "carb": 1,
## "_row": "Valiant"
## },
## {
## "mpg": 14.3,
## "cyl": 8,
## "disp": 360,
## "hp": 245,
## "drat": 3.21,
## "wt": 3.57,
## "qsec": 15.84,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 4,
## "_row": "Duster 360"
## },
## {
## "mpg": 24.4,
## "cyl": 4,
## "disp": 146.7,
## "hp": 62,
## "drat": 3.69,
## "wt": 3.19,
## "qsec": 20,
## "vs": 1,
## "am": 0,
## "gear": 4,
## "carb": 2,
## "_row": "Merc 240D"
## },
## {
## "mpg": 22.8,
## "cyl": 4,
## "disp": 140.8,
## "hp": 95,
## "drat": 3.92,
## "wt": 3.15,
## "qsec": 22.9,
## "vs": 1,
## "am": 0,
## "gear": 4,
## "carb": 2,
## "_row": "Merc 230"
## },
## {
## "mpg": 19.2,
## "cyl": 6,
## "disp": 167.6,
## "hp": 123,
## "drat": 3.92,
## "wt": 3.44,
## "qsec": 18.3,
## "vs": 1,
## "am": 0,
## "gear": 4,
## "carb": 4,
## "_row": "Merc 280"
## },
## {
## "mpg": 17.8,
## "cyl": 6,
## "disp": 167.6,
## "hp": 123,
## "drat": 3.92,
## "wt": 3.44,
## "qsec": 18.9,
## "vs": 1,
## "am": 0,
## "gear": 4,
## "carb": 4,
## "_row": "Merc 280C"
## },
## {
## "mpg": 16.4,
## "cyl": 8,
## "disp": 275.8,
## "hp": 180,
## "drat": 3.07,
## "wt": 4.07,
## "qsec": 17.4,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 3,
## "_row": "Merc 450SE"
## },
## {
## "mpg": 17.3,
## "cyl": 8,
## "disp": 275.8,
## "hp": 180,
## "drat": 3.07,
## "wt": 3.73,
## "qsec": 17.6,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 3,
## "_row": "Merc 450SL"
## },
## {
## "mpg": 15.2,
## "cyl": 8,
## "disp": 275.8,
## "hp": 180,
## "drat": 3.07,
## "wt": 3.78,
## "qsec": 18,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 3,
## "_row": "Merc 450SLC"
## },
## {
## "mpg": 10.4,
## "cyl": 8,
## "disp": 472,
## "hp": 205,
## "drat": 2.93,
## "wt": 5.25,
## "qsec": 17.98,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 4,
## "_row": "Cadillac Fleetwood"
## },
## {
## "mpg": 10.4,
## "cyl": 8,
## "disp": 460,
## "hp": 215,
## "drat": 3,
## "wt": 5.424,
## "qsec": 17.82,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 4,
## "_row": "Lincoln Continental"
## },
## {
## "mpg": 14.7,
## "cyl": 8,
## "disp": 440,
## "hp": 230,
## "drat": 3.23,
## "wt": 5.345,
## "qsec": 17.42,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 4,
## "_row": "Chrysler Imperial"
## },
## {
## "mpg": 32.4,
## "cyl": 4,
## "disp": 78.7,
## "hp": 66,
## "drat": 4.08,
## "wt": 2.2,
## "qsec": 19.47,
## "vs": 1,
## "am": 1,
## "gear": 4,
## "carb": 1,
## "_row": "Fiat 128"
## },
## {
## "mpg": 30.4,
## "cyl": 4,
## "disp": 75.7,
## "hp": 52,
## "drat": 4.93,
## "wt": 1.615,
## "qsec": 18.52,
## "vs": 1,
## "am": 1,
## "gear": 4,
## "carb": 2,
## "_row": "Honda Civic"
## },
## {
## "mpg": 33.9,
## "cyl": 4,
## "disp": 71.1,
## "hp": 65,
## "drat": 4.22,
## "wt": 1.835,
## "qsec": 19.9,
## "vs": 1,
## "am": 1,
## "gear": 4,
## "carb": 1,
## "_row": "Toyota Corolla"
## },
## {
## "mpg": 21.5,
## "cyl": 4,
## "disp": 120.1,
## "hp": 97,
## "drat": 3.7,
## "wt": 2.465,
## "qsec": 20.01,
## "vs": 1,
## "am": 0,
## "gear": 3,
## "carb": 1,
## "_row": "Toyota Corona"
## },
## {
## "mpg": 15.5,
## "cyl": 8,
## "disp": 318,
## "hp": 150,
## "drat": 2.76,
## "wt": 3.52,
## "qsec": 16.87,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 2,
## "_row": "Dodge Challenger"
## },
## {
## "mpg": 15.2,
## "cyl": 8,
## "disp": 304,
## "hp": 150,
## "drat": 3.15,
## "wt": 3.435,
## "qsec": 17.3,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 2,
## "_row": "AMC Javelin"
## },
## {
## "mpg": 13.3,
## "cyl": 8,
## "disp": 350,
## "hp": 245,
## "drat": 3.73,
## "wt": 3.84,
## "qsec": 15.41,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 4,
## "_row": "Camaro Z28"
## },
## {
## "mpg": 19.2,
## "cyl": 8,
## "disp": 400,
## "hp": 175,
## "drat": 3.08,
## "wt": 3.845,
## "qsec": 17.05,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 2,
## "_row": "Pontiac Firebird"
## },
## {
## "mpg": 27.3,
## "cyl": 4,
## "disp": 79,
## "hp": 66,
## "drat": 4.08,
## "wt": 1.935,
## "qsec": 18.9,
## "vs": 1,
## "am": 1,
## "gear": 4,
## "carb": 1,
## "_row": "Fiat X1-9"
## },
## {
## "mpg": 26,
## "cyl": 4,
## "disp": 120.3,
## "hp": 91,
## "drat": 4.43,
## "wt": 2.14,
## "qsec": 16.7,
## "vs": 0,
## "am": 1,
## "gear": 5,
## "carb": 2,
## "_row": "Porsche 914-2"
## },
## {
## "mpg": 30.4,
## "cyl": 4,
## "disp": 95.1,
## "hp": 113,
## "drat": 3.77,
## "wt": 1.513,
## "qsec": 16.9,
## "vs": 1,
## "am": 1,
## "gear": 5,
## "carb": 2,
## "_row": "Lotus Europa"
## },
## {
## "mpg": 15.8,
## "cyl": 8,
## "disp": 351,
## "hp": 264,
## "drat": 4.22,
## "wt": 3.17,
## "qsec": 14.5,
## "vs": 0,
## "am": 1,
## "gear": 5,
## "carb": 4,
## "_row": "Ford Pantera L"
## },
## {
## "mpg": 19.7,
## "cyl": 6,
## "disp": 145,
## "hp": 175,
## "drat": 3.62,
## "wt": 2.77,
## "qsec": 15.5,
## "vs": 0,
## "am": 1,
## "gear": 5,
## "carb": 6,
## "_row": "Ferrari Dino"
## },
## {
## "mpg": 15,
## "cyl": 8,
## "disp": 301,
## "hp": 335,
## "drat": 3.54,
## "wt": 3.57,
## "qsec": 14.6,
## "vs": 0,
## "am": 1,
## "gear": 5,
## "carb": 8,
## "_row": "Maserati Bora"
## },
## {
## "mpg": 21.4,
## "cyl": 4,
## "disp": 121,
## "hp": 109,
## "drat": 4.11,
## "wt": 2.78,
## "qsec": 18.6,
## "vs": 1,
## "am": 1,
## "gear": 4,
## "carb": 2,
## "_row": "Volvo 142E"
## }
## ]
# Minify pretty_json: mini_json
mini_json <- minify(pretty_json)
mini_json
## [{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.62,"qsec":16.46,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4"},{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.875,"qsec":17.02,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4 Wag"},{"mpg":22.8,"cyl":4,"disp":108,"hp":93,"drat":3.85,"wt":2.32,"qsec":18.61,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Datsun 710"},{"mpg":21.4,"cyl":6,"disp":258,"hp":110,"drat":3.08,"wt":3.215,"qsec":19.44,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Hornet 4 Drive"},{"mpg":18.7,"cyl":8,"disp":360,"hp":175,"drat":3.15,"wt":3.44,"qsec":17.02,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Hornet Sportabout"},{"mpg":18.1,"cyl":6,"disp":225,"hp":105,"drat":2.76,"wt":3.46,"qsec":20.22,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Valiant"},{"mpg":14.3,"cyl":8,"disp":360,"hp":245,"drat":3.21,"wt":3.57,"qsec":15.84,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Duster 360"},{"mpg":24.4,"cyl":4,"disp":146.7,"hp":62,"drat":3.69,"wt":3.19,"qsec":20,"vs":1,"am":0,"gear":4,"carb":2,"_row":"Merc 240D"},{"mpg":22.8,"cyl":4,"disp":140.8,"hp":95,"drat":3.92,"wt":3.15,"qsec":22.9,"vs":1,"am":0,"gear":4,"carb":2,"_row":"Merc 230"},{"mpg":19.2,"cyl":6,"disp":167.6,"hp":123,"drat":3.92,"wt":3.44,"qsec":18.3,"vs":1,"am":0,"gear":4,"carb":4,"_row":"Merc 280"},{"mpg":17.8,"cyl":6,"disp":167.6,"hp":123,"drat":3.92,"wt":3.44,"qsec":18.9,"vs":1,"am":0,"gear":4,"carb":4,"_row":"Merc 280C"},{"mpg":16.4,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":4.07,"qsec":17.4,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SE"},{"mpg":17.3,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":3.73,"qsec":17.6,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SL"},{"mpg":15.2,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":3.78,"qsec":18,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SLC"},{"mpg":10.4,"cyl":8,"disp":472,"hp":205,"drat":2.93,"wt":5.25,"qsec":17.98,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Cadillac Fleetwood"},{"mpg":10.4,"cyl":8,"disp":460,"hp":215,"drat":3,"wt":5.424,"qsec":17.82,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Lincoln Continental"},{"mpg":14.7,"cyl":8,"disp":440,"hp":230,"drat":3.23,"wt":5.345,"qsec":17.42,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Chrysler Imperial"},{"mpg":32.4,"cyl":4,"disp":78.7,"hp":66,"drat":4.08,"wt":2.2,"qsec":19.47,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Fiat 128"},{"mpg":30.4,"cyl":4,"disp":75.7,"hp":52,"drat":4.93,"wt":1.615,"qsec":18.52,"vs":1,"am":1,"gear":4,"carb":2,"_row":"Honda Civic"},{"mpg":33.9,"cyl":4,"disp":71.1,"hp":65,"drat":4.22,"wt":1.835,"qsec":19.9,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Toyota Corolla"},{"mpg":21.5,"cyl":4,"disp":120.1,"hp":97,"drat":3.7,"wt":2.465,"qsec":20.01,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Toyota Corona"},{"mpg":15.5,"cyl":8,"disp":318,"hp":150,"drat":2.76,"wt":3.52,"qsec":16.87,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Dodge Challenger"},{"mpg":15.2,"cyl":8,"disp":304,"hp":150,"drat":3.15,"wt":3.435,"qsec":17.3,"vs":0,"am":0,"gear":3,"carb":2,"_row":"AMC Javelin"},{"mpg":13.3,"cyl":8,"disp":350,"hp":245,"drat":3.73,"wt":3.84,"qsec":15.41,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Camaro Z28"},{"mpg":19.2,"cyl":8,"disp":400,"hp":175,"drat":3.08,"wt":3.845,"qsec":17.05,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Pontiac Firebird"},{"mpg":27.3,"cyl":4,"disp":79,"hp":66,"drat":4.08,"wt":1.935,"qsec":18.9,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Fiat X1-9"},{"mpg":26,"cyl":4,"disp":120.3,"hp":91,"drat":4.43,"wt":2.14,"qsec":16.7,"vs":0,"am":1,"gear":5,"carb":2,"_row":"Porsche 914-2"},{"mpg":30.4,"cyl":4,"disp":95.1,"hp":113,"drat":3.77,"wt":1.513,"qsec":16.9,"vs":1,"am":1,"gear":5,"carb":2,"_row":"Lotus Europa"},{"mpg":15.8,"cyl":8,"disp":351,"hp":264,"drat":4.22,"wt":3.17,"qsec":14.5,"vs":0,"am":1,"gear":5,"carb":4,"_row":"Ford Pantera L"},{"mpg":19.7,"cyl":6,"disp":145,"hp":175,"drat":3.62,"wt":2.77,"qsec":15.5,"vs":0,"am":1,"gear":5,"carb":6,"_row":"Ferrari Dino"},{"mpg":15,"cyl":8,"disp":301,"hp":335,"drat":3.54,"wt":3.57,"qsec":14.6,"vs":0,"am":1,"gear":5,"carb":8,"_row":"Maserati Bora"},{"mpg":21.4,"cyl":4,"disp":121,"hp":109,"drat":4.11,"wt":2.78,"qsec":18.6,"vs":1,"am":1,"gear":4,"carb":2,"_row":"Volvo 142E"}]
haven
and foreign
R supports SAS, STATA, and SPSS.
library(haven)
## Warning: package 'haven' was built under R version 3.4.4
sales <- read_sas("http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/sales.sas7bdat")
sugar <- read_dta("http://assets.datacamp.com/production/course_1478/datasets/trade.dta")
# Convert labeled values in Date column to dates
sugar$Date <- as.Date(as_factor(sugar$Date))
dat <- read_dta("http://assets.datacamp.com/production/course_1478/datasets/trade.dta")
library(foreign)
# foreign can load xprt files but not sas7dat files.
# load in the data and store it in the variable cars
cars <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/course/uva/mtcars.csv")
# print the first 6 rows of the dataset using the head() function
head(cars)
## mpg cyl disp hp drat wt qsec vs am gear carb car
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Datsun 710
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Hornet Sportabout
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 Valiant
Change the variable separator for text files with the sep
argument. Use sep = 't'
for tab.
# load in the dataset
cars <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/course/uva/mtcars_semicolon.csv", sep = ";")
# print the first 6 rows of the dataset
head(cars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Get and set you working directory.
getwd()
## [1] "C:/Users/mpfol/OneDrive/Documents/Data Analysis"
list.files()
## [1] "Analyzing Survey Data in R.Rmd"
## [2] "Analyzing_Survey_Data_in_R.html"
## [3] "Cookbook for R.Rmd"
## [4] "Cookbook_for_R.html"
## [5] "Cookbook_for_R.Rmd"
## [6] "Cookbook_for_R_files"
## [7] "Coursework"
## [8] "Data"
## [9] "Data Analysis.docx"
## [10] "Data Analysis.xlsx"
## [11] "Data Visualization.docx"
## [12] "Foundations of Inference.Rmd"
## [13] "Foundations_of_Inference.html"
## [14] "local_latitude.xls"
## [15] "Programs"
## [16] "rmarkdown-cheatsheet.pdf"
## [17] "rsconnect"
## [18] "Statistical Analysis.docx"
## [19] "Statistical Package Syntax (1).docx"
## [20] "Statistics Notes.docx"
## [21] "Statistics v20170301.docx"
Data exploration starts with evaluation of structure and characteristics using class()
(it better be a data.frame), dim()
, and names()
. Create summaries with str()
or glimpse()
, and summary()
. Run some initial visualizations for insights into distributions. Use histograms for univariate analysis, scatterplots for numeric-numeric bi-variate analysis, and boxplots for numeric-factor bi-variate analysis.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:gdata':
##
## combine, first, last
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Check structure
class(mtcars)
## [1] "data.frame"
dim(mtcars)
## [1] 32 11
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
# Initial summaries
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
glimpse(mtcars) # Slightly cleaner version of str (requires dplyr).
## Observations: 32
## Variables: 11
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
hist(mtcars$mpg)
plot(mtcars$mpg, mtcars$qsec)
# View sample data
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
tail(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
Tidy data organizes a single observational unit into rows and columns. Use the tidyr
package to tidy messy data.
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.4
wide_df <- data.frame(Obs=c(1,2),
a=c(1,4),
b=c(2,5),
c=c(3,6),
year_mo=c("2010-05","2007-07"))
wide_df
## Obs a b c year_mo
## 1 1 1 2 3 2010-05
## 2 2 4 5 6 2007-07
# Gather wide data into key-value pairs. Exclude Obs and year_mo
long_df <- gather(wide_df, my_key, my_val, -c(Obs,year_mo))
long_df
## Obs year_mo my_key my_val
## 1 1 2010-05 a 1
## 2 2 2007-07 a 4
## 3 1 2010-05 b 2
## 4 2 2007-07 b 5
## 5 1 2010-05 c 3
## 6 2 2007-07 c 6
# The opposite of gather() is spread()
wide_df <- spread(long_df, my_key, my_val)
wide_df
## Obs year_mo a b c
## 1 1 2010-05 1 2 3
## 2 2 2007-07 4 5 6
# Split a column using separate().
long_df_sep <- separate(long_df, col = year_mo, into = c("year","month"), sep = "-")
long_df_sep
## Obs year month my_key my_val
## 1 1 2010 05 a 1
## 2 2 2007 07 a 4
## 3 1 2010 05 b 2
## 4 2 2007 07 b 5
## 5 1 2010 05 c 3
## 6 2 2007 07 c 6
# The opposite of separate() is unite()
long_df_uni <- unite(long_df_sep, year_mo, year, month, sep = "-")
long_df_uni
## Obs year_mo my_key my_val
## 1 1 2010-05 a 1
## 2 2 2007-07 a 4
## 3 1 2010-05 b 2
## 4 2 2007-07 b 5
## 5 1 2010-05 c 3
## 6 2 2007-07 c 6
Types of variables in R: * character * numeric, including NaN
and inf
. * integer, denoted 123L
* factor * logical, included NA
.
Coerce variables into data types with * as.character()
* as.numeric()
* as.integer()
* as.factor()
* as.logical()
where 0 := FALSE * Package lubridate
coerces strings to dates. Valid masking characters are y
, m
, d
, h
, m
, and s
. Unite several fields into one with unite()
. Rearrange column order with select()
. Change the structure of multiple columns with mutate_at
.
Because the period (.) has special meaning in certain situations, use underscores (_) to separate words in variable names. Use all lowercase letters so that no one has to remember which letters are uppercase or lowercase.
Package lubridate
manipulates dates. Round dates with round_date
, floor_date
, and ceiling_date
. All three take a unit argument specifying the resolution of rounding: “second”, “minute”, “hour”, “day”, “week”, “month”, “bimonth”, “quarter”, “halfyear”, or “year”. Or, you can specify any multiple of those units, e.g. “5 years”, “3 minutes” etc.
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.4.4
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday,
## week, yday, year
## The following object is masked from 'package:base':
##
## date
# There 3! ymd date functions: ymd(), ydm(), mdy(), myd(), dmy(), dym().
# Create datetimes with: _h, _hm, or _hms
as.Date(ymd_hms("2005/10/23 14:40:00"))
## [1] "2005-10-23"
as.POSIXct(mdy("July 21, 2006"))
## [1] "2006-07-20 20:00:00 EDT"
ymd("2006-07-21")
## [1] "2006-07-21"
ymd("2006 Jul 21")
## [1] "2006-07-21"
mdy("July 21, 2006")
## [1] "2006-07-21"
hms("10:25:09")
## [1] "10H 25M 9S"
ymd_hms("2005/10/23 14:40:00")
## [1] "2005-10-23 14:40:00 UTC"
# If date is in an unsupported order like dym_msh, use parse_date_time() with argument orders specifying the order of the components in the date.
# Combine date parts with make_date(year, month, date).
r_3_4_1 <- ymd_hms("2016-05-03 07:13:28 UTC")
# Date rounding
floor_date(r_3_4_1, unit = "day")
## [1] "2016-05-03 UTC"
round_date(r_3_4_1, unit = "5 minutes")
## [1] "2016-05-03 07:15:00 UTC"
ceiling_date(r_3_4_1, unit = "week")
## [1] "2016-05-08 UTC"
Subtract dates with simple -
operator for days unit, or get finer control with base
function difftime(t1, t2, units)
. Available system dates are now
and today()
.
date_landing <- mdy("July 20, 1969")
moment_step <- mdy_hms("July 20, 1969, 02:56:15", tz = "UTC")
difftime(today(), date_landing, units = "days")
## Time difference of 18075 days
difftime(now(), moment_step, units = "secs")
## Time difference of 1561709101 secs
Use timespans to add fixed amount of time to dates. Distinguish periods (human understanding) from durations (number of seconds) to handle daylight savings time gracefully. By combining addition and multiplication with sequences you can generate sequences of datetimes.
library(lubridate)
# Add a period of one week to mon_2pm
mon_2pm <- dmy_hm("27 Aug 2018 14:00")
mon_2pm + weeks(1)
## [1] "2018-09-03 14:00:00 UTC"
# Add a duration of 81 hours to tue_9am
tue_9am <- dmy_hm("28 Aug 2018 9:00")
tue_9am + dhours(81)
## [1] "2018-08-31 18:00:00 UTC"
# A period of five years is longer than a duration of 5 years!
today() - years(5)
## [1] "2014-01-14"
today() - dyears(5)
## [1] "2014-01-15"
# Create combined periods and durations.
eclipse_2017 <- ymd_hms("2017-08-21 18:26:40")
synodic <- ddays(29) + dhours(12) + dminutes(44) + dseconds(3)
# Create datetime for every two weeks for a year
today_8am <- today() + hours(8)
every_two_weeks <- 1:26 * weeks(2)
today_8am + every_two_weeks
## [1] "2019-01-28 08:00:00 UTC" "2019-02-11 08:00:00 UTC"
## [3] "2019-02-25 08:00:00 UTC" "2019-03-11 08:00:00 UTC"
## [5] "2019-03-25 08:00:00 UTC" "2019-04-08 08:00:00 UTC"
## [7] "2019-04-22 08:00:00 UTC" "2019-05-06 08:00:00 UTC"
## [9] "2019-05-20 08:00:00 UTC" "2019-06-03 08:00:00 UTC"
## [11] "2019-06-17 08:00:00 UTC" "2019-07-01 08:00:00 UTC"
## [13] "2019-07-15 08:00:00 UTC" "2019-07-29 08:00:00 UTC"
## [15] "2019-08-12 08:00:00 UTC" "2019-08-26 08:00:00 UTC"
## [17] "2019-09-09 08:00:00 UTC" "2019-09-23 08:00:00 UTC"
## [19] "2019-10-07 08:00:00 UTC" "2019-10-21 08:00:00 UTC"
## [21] "2019-11-04 08:00:00 UTC" "2019-11-18 08:00:00 UTC"
## [23] "2019-12-02 08:00:00 UTC" "2019-12-16 08:00:00 UTC"
## [25] "2019-12-30 08:00:00 UTC" "2020-01-13 08:00:00 UTC"
ymd("2018-01-31") + months(1)
returns NA. For situations like this, use alternative operators like %m+%
.
library(lubridate)
# A sequence of 1 to 12 periods of 1 month
month_seq <- 1:12 * months(1)
# Add 1 to 12 months to jan_31. This way returns NAs.
ymd("2018-01-31") + month_seq
## [1] NA "2018-03-31" NA "2018-05-31" NA
## [6] "2018-07-31" "2018-08-31" NA "2018-10-31" NA
## [11] "2018-12-31" "2019-01-31"
# Better way.
ymd("2018-01-31") %m+% month_seq
## [1] "2018-02-28" "2018-03-31" "2018-04-30" "2018-05-31" "2018-06-30"
## [6] "2018-07-31" "2018-08-31" "2018-09-30" "2018-10-31" "2018-11-30"
## [11] "2018-12-31" "2019-01-31"
Intervals have a specific start and end time. There are two notations: datetime1 %--% datetime2
, or interval(datetime1, datetime2)
.
# Two ways to create an interval.
dmy("5 January 1961") %--% dmy("30 January 1969")
## [1] 1961-01-05 UTC--1969-01-30 UTC
interval(dmy("5 January 1961"), dmy("30 January 1969"))
## [1] 1961-01-05 UTC--1969-01-30 UTC
Once you have an interval you can find out its start, end, and length with int_start(), int_end() and int_length() respectively. You can test whether a date is %within%
and interval. You can test whether two intervals overlap with int_overlaps()
.
my_intvl <- interval(dmy("5 January 1961"), dmy("30 January 1969"))
int_length(my_intvl)
## [1] 254620800
y2001 <- ymd("2001-01-01") %--% ymd("2001-12-31")
ymd("2001-03-30") %within% y2001
## [1] TRUE
Convert an interval to a period or duration with as.period
and as.duration
.
my_intvl <- interval(dmy("5 January 1961"), dmy("30 January 1969"))
as.period(my_intvl)
## [1] "8y 0m 25d 0H 0M 0S"
as.duration(my_intvl)
## [1] "254620800s (~8.07 years)"
Extract timezone with tz()
. Change timezone with force_tz(dt, tzone=)
or temporarily view it with with_tz(dt, tzone=)
. Get tzone
names from ’OlsonNames()`.
game2 <- mdy_hm("June 11 2015 19:00")
game3 <- mdy_hm("June 15 2015 18:30")
# Set the timezone to "America/Edmonton"
game2_local <- force_tz(game2, tzone = "America/Edmonton")
game3_local <- force_tz(game3, tzone = "America/Winnipeg")
# What time is game2_local in NZ?
with_tz(game2_local, tzone = "Pacific/Auckland")
## [1] "2015-06-12 13:00:00 NZST"
stamp
is a great way to format a date. It returns a function with format string you specify by example.
stamp("09/20/2017")(today())
## Multiple formats matched: "%Om/%d/%y%H"(1), "%Om/%y/%d%H"(1), "%Om/%d/%Y"(1), "%m/%d/%y%H"(1), "%m/%y/%d%H"(1), "%m/%d/%Y"(1)
## Using: "%Om/%y/%d%H"
## [1] "01/19/1400"
Package stringr
manipulates strings.
library(stringr)
# trim whitespace.
str_trim(" this is a test ")
## [1] "this is a test"
# pad string with zeros.
str_pad("2493", width = 7, side = "left", pad = "0")
## [1] "0002493"
# find pattern Alice
str_detect(c("Sarah", "Alice", "Tom"), "Alice")
## [1] FALSE TRUE FALSE
# replace pattern Alice with Jeff
str_replace(c("Sarah", "Alice", "Tom"), "Alice", "Jeff")
## [1] "Sarah" "Jeff" "Tom"
# Change case
toupper("DataCamp")
## [1] "DATACAMP"
tolower("DataCamp")
## [1] "datacamp"
Use is.na()
to locate null values.
# 4x3 data frame with a few NAs.
df <- data.frame(A = c(1, NA, 8, NA),
B = c(3, NA, 88, 23),
C = c(2, 45, 3, 1),
D = c("A", "", "C", "D"))
# Any NAs?
any(is.na(df))
## [1] TRUE
# locate the NAs.
is.na(df)
## A B C D
## [1,] FALSE FALSE FALSE FALSE
## [2,] TRUE TRUE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE
## [4,] TRUE FALSE FALSE FALSE
# How many?
sum(is.na(df))
## [1] 3
# Summarize the NAs
summary(df)
## A B C D
## Min. :1.00 Min. : 3.0 Min. : 1.00 :1
## 1st Qu.:2.75 1st Qu.:13.0 1st Qu.: 1.75 A:1
## Median :4.50 Median :23.0 Median : 2.50 C:1
## Mean :4.50 Mean :38.0 Mean :12.75 D:1
## 3rd Qu.:6.25 3rd Qu.:55.5 3rd Qu.:13.50
## Max. :8.00 Max. :88.0 Max. :45.00
## NA's :2 NA's :1
# Rows with no missing values, two ways
df[complete.cases(df),]
## A B C D
## 1 1 3 2 A
## 3 8 88 3 C
na.omit(df)
## A B C D
## 1 1 3 2 A
## 3 8 88 3 C
# Replace empty strings with NA
df$D <- df$D[df$D == ""] <- NA
df2 <- data.frame(A = rnorm(100,50,10),
B = c(rnorm(99,50,10), 500),
C = c(rnorm(99,50,10), -1))
# Find outliers using hist() or boxplot().
hist(df2$B)
boxplot(df2)
# Drop or replace outliers. Use which() to find index of offending observation.
mymtcars <- mtcars
ind <- which(mymtcars$mpg == 15.0)
mymtcars$mpg[ind] = 20.0
dplyr
The dplyr
package provides data wrangling tools. dplyr
introduces the tibble, a dataframe constrained to display well in an R session. The tibble class inherits from the data frame class. Work with a tibble using the tbl_df(data.frame)
function. glimpse(tbl)
works with tibbles the way str(data.frame)
works with data frames. Convert a tibble back to a data frame with as.data.frame(tbl)
.
library(dplyr)
# hflights is a data.frame of Houston based flights.
library(hflights)
## Warning: package 'hflights' was built under R version 3.4.4
hflights <- as_tibble(hflights)
head(hflights)
## # A tibble: 6 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 2 7 1401 1501 AA 428
## 3 2011 1 3 1 1352 1502 AA 428
## 4 2011 1 4 2 1403 1513 AA 428
## 5 2011 1 5 3 1405 1507 AA 428
## 6 2011 1 6 4 1359 1503 AA 428
## # ... with 13 more variables: TailNum <chr>, ActualElapsedTime <int>,
## # AirTime <int>, ArrDelay <int>, DepDelay <int>, Origin <chr>,
## # Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## # Cancelled <int>, CancellationCode <chr>, Diverted <int>
summary(hflights)
## Year Month DayofMonth DayOfWeek
## Min. :2011 Min. : 1.000 Min. : 1.00 Min. :1.000
## 1st Qu.:2011 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2.000
## Median :2011 Median : 7.000 Median :16.00 Median :4.000
## Mean :2011 Mean : 6.514 Mean :15.74 Mean :3.948
## 3rd Qu.:2011 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:6.000
## Max. :2011 Max. :12.000 Max. :31.00 Max. :7.000
##
## DepTime ArrTime UniqueCarrier FlightNum
## Min. : 1 Min. : 1 Length:227496 Min. : 1
## 1st Qu.:1021 1st Qu.:1215 Class :character 1st Qu.: 855
## Median :1416 Median :1617 Mode :character Median :1696
## Mean :1396 Mean :1578 Mean :1962
## 3rd Qu.:1801 3rd Qu.:1953 3rd Qu.:2755
## Max. :2400 Max. :2400 Max. :7290
## NA's :2905 NA's :3066
## TailNum ActualElapsedTime AirTime ArrDelay
## Length:227496 Min. : 34.0 Min. : 11.0 Min. :-70.000
## Class :character 1st Qu.: 77.0 1st Qu.: 58.0 1st Qu.: -8.000
## Mode :character Median :128.0 Median :107.0 Median : 0.000
## Mean :129.3 Mean :108.1 Mean : 7.094
## 3rd Qu.:165.0 3rd Qu.:141.0 3rd Qu.: 11.000
## Max. :575.0 Max. :549.0 Max. :978.000
## NA's :3622 NA's :3622 NA's :3622
## DepDelay Origin Dest Distance
## Min. :-33.000 Length:227496 Length:227496 Min. : 79.0
## 1st Qu.: -3.000 Class :character Class :character 1st Qu.: 376.0
## Median : 0.000 Mode :character Mode :character Median : 809.0
## Mean : 9.445 Mean : 787.8
## 3rd Qu.: 9.000 3rd Qu.:1042.0
## Max. :981.000 Max. :3904.0
## NA's :2905
## TaxiIn TaxiOut Cancelled CancellationCode
## Min. : 1.000 Min. : 1.00 Min. :0.00000 Length:227496
## 1st Qu.: 4.000 1st Qu.: 10.00 1st Qu.:0.00000 Class :character
## Median : 5.000 Median : 14.00 Median :0.00000 Mode :character
## Mean : 6.099 Mean : 15.09 Mean :0.01307
## 3rd Qu.: 7.000 3rd Qu.: 18.00 3rd Qu.:0.00000
## Max. :165.000 Max. :163.00 Max. :1.00000
## NA's :3066 NA's :2947
## Diverted
## Min. :0.000000
## 1st Qu.:0.000000
## Median :0.000000
## Mean :0.002853
## 3rd Qu.:0.000000
## Max. :1.000000
##
# hflights consists of 227,496 observations and 21 variables.
nrow(hflights)
## [1] 227496
ncol(hflights)
## [1] 21
# Create a lookup table for the UniqueCarrier column using a named vector.
lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental",
"DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways",
"WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier",
"FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")
hflights$Carrier <- lut[hflights$UniqueCarrier]
dplyr
features five verbs. * select(.data, ...)
where ...
are variables. Use :
to select a range of variables, and -
to exclude some variables, similar to indexing a data.frame with square brackets. Use variable names or integer indexes. Use helper functions starts_with()
, ends_with()
, contains()
, matches()
, num_range()
, and one_of()
. * filter(.data, one or more comparisons)
. Among the operators are ==
, !=
, and %in%
. Combine comparisons with &
and |
. * arrange(.data, ...)
. Wrap the arguments with desc()
to override the default sort order. * mutate(.data, name-value pair of expressions)
. * summarise(.data, ...)
. Base r includes several aggregate functions, and dplyr
adds first()
, last()
, nth()
, n()
, and n_distinct()
. Pipe a data set with %>%
into a verb. The filter()
verb returns a filtered data set. The arrange()
verb returns a sorted data set. Arrange in descending order by arrange(desc(gdpPerCap))
. The mutate()
verb adds or changes values in the data set. group_by(.data, col(s))
. group_by
only has an effect when combined with a summarize()
function. Specify group_by
prior to summarize()
.
dplry
uses %>%
from the magrittr
package.
library(dplyr)
library(hflights)
hflights <- as_tibble(hflights)
lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental",
"DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways",
"WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier",
"FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")
hflights$Carrier <- lut[hflights$UniqueCarrier]
# select example
select(hflights, UniqueCarrier, ends_with("Num"), starts_with("Cancell"))
## # A tibble: 227,496 x 5
## UniqueCarrier FlightNum TailNum Cancelled CancellationCode
## * <chr> <int> <chr> <int> <chr>
## 1 AA 428 N576AA 0 ""
## 2 AA 428 N557AA 0 ""
## 3 AA 428 N541AA 0 ""
## 4 AA 428 N403AA 0 ""
## 5 AA 428 N492AA 0 ""
## 6 AA 428 N262AA 0 ""
## 7 AA 428 N493AA 0 ""
## 8 AA 428 N477AA 0 ""
## 9 AA 428 N476AA 0 ""
## 10 AA 428 N504AA 0 ""
## # ... with 227,486 more rows
# mutate example
g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime)
## Warning: `as_dictionary()` is soft-deprecated as of rlang 0.3.0.
## Please use `as_data_pronoun()` instead
## This warning is displayed once per session.
## Warning: `new_overscope()` is soft-deprecated as of rlang 0.2.0.
## Please use `new_data_mask()` instead
## This warning is displayed once per session.
## Warning: The `parent` argument of `new_data_mask()` is deprecated.
## The parent of the data mask is determined from either:
##
## * The `env` argument of `eval_tidy()`
## * Quosure environments when applicable
## This warning is displayed once per session.
## Warning: `overscope_clean()` is soft-deprecated as of rlang 0.2.0.
## This warning is displayed once per session.
# filter example
hflights %>%
mutate(RealTime = ActualElapsedTime + 100, mph = 60 * Distance/ RealTime) %>%
filter(!is.na(mph) & mph < 70) %>%
group_by(UniqueCarrier) %>%
summarize(n_less = n(), n_dest = n_distinct(Dest), min_dist = min(Distance), max_dist = max(Distance))
## # A tibble: 6 x 5
## UniqueCarrier n_less n_dest min_dist max_dist
## <chr> <int> <int> <dbl> <dbl>
## 1 AA 40 1 224. 224.
## 2 CO 3393 4 140. 305.
## 3 MQ 12 1 247. 247.
## 4 OO 349 3 140. 224.
## 5 WN 1747 4 148. 239.
## 6 XE 1185 12 79. 253.
dplyr
works for data frames, data tables, and databases.
Use dplyr
to merge data instead of base r merge()
because dplr
syntax is intuitive, preserves row order, and works with databases.
The four mutating joins are left_join(tbl1, tbl2, by = c(col_names))
, right_join
, inner_join
, and full_join
.
Filter join semi_join
performs an inner join without returning the secondary table. Filter join anti_join
performs a right where the right table is null.
Set functions union()
, intersect
, and setdiff
.
setequal(set1, set2)
checks for row equality (not necesarily order).
If two datasets have identical structure, combine with bind_rows()
and bind_cols()
, the dplyr
equivalent to base r rbind()
and cbind
.
dplyr
improves base r
functions data.frame
with data_frame()
. data_frame()
will not change data types, add row or column names, or recycle vectors. Function as_data_frame()
parellels the behavior of data_frame()
. as_data_frame
combines a list of vectors into a data frame. It is the column equivalent of bind_rows()
which combines data frames.
library(Lahman)
## Warning: package 'Lahman' was built under R version 3.4.4
library(dplyr)
players <- Master %>%
distinct(playerID, nameFirst, nameLast)
players %>%
# Find unsalaried players
anti_join(Salaries, by = "playerID") %>%
# Join Batting to the unsalaried players
left_join(Batting, by = "playerID") %>%
# Group by player
group_by(playerID) %>%
# Sum at-bats for each player
summarise(total_at_bat = sum(AB, na.rm = TRUE)) %>%
# Arrange in descending order
arrange(desc(total_at_bat))
## # A tibble: 13,958 x 2
## playerID total_at_bat
## <chr> <int>
## 1 aaronha01 12364
## 2 yastrca01 11988
## 3 cobbty01 11434
## 4 musiast01 10972
## 5 mayswi01 10881
## 6 robinbr01 10654
## 7 wagneho01 10430
## 8 brocklo01 10332
## 9 ansonca01 10277
## 10 aparilu01 10230
## # ... with 13,948 more rows
library(Lahman)
library(dplyr)
# Find the distinct players that appear in HallOfFame
nominated <- HallOfFame %>%
distinct(playerID)
nominated %>%
# Count the number of players in nominated
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 1260
# 1,239 players were nominated for the hall of fame.
nominated_full <- nominated %>%
# Join to Master
left_join(Master, by = "playerID") %>%
# Return playerID, nameFirst, nameLast
select(playerID, nameFirst, nameLast)
# Find distinct players in HallOfFame with inducted == "Y"
inducted <- HallOfFame %>%
filter(inducted == "Y") %>%
distinct(playerID)
inducted %>%
# Count the number of players in inducted
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 317
# 312 players have been inducted.
inducted_full <- inducted %>%
# Join to Master
left_join(Master, by = "playerID") %>%
# Return playerID, nameFirst, nameLast
select(playerID, nameFirst, nameLast)
# Tally the number of awards in AwardsPlayers by playerID
nAwards <- AwardsPlayers %>%
group_by(playerID) %>%
tally()
nAwards %>%
# Filter to just the players in inducted
semi_join(inducted, by = "playerID") %>%
# Calculate the mean number of awards per player
summarize(avg_n = mean(n, na.rm = TRUE))
## # A tibble: 1 x 1
## avg_n
## <dbl>
## 1 12.1
nAwards %>%
# Filter to just the players in nominated
semi_join(nominated, by = "playerID") %>%
# Filter to players NOT in inducted
anti_join(inducted, by = "playerID") %>%
# Calculate the mean number of awards per player
summarize(avg_n = mean(n, na.rm = TRUE))
## # A tibble: 1 x 1
## avg_n
## <dbl>
## 1 4.23
# On Average, inductees had 11.95 - 4.23 = 7.72 more awards than non-inductees.
# Find the players who are in nominated, but not inducted
notInducted <- nominated %>%
setdiff(inducted)
Salaries %>%
# Find the players who are in notInducted
semi_join(notInducted, by = "playerID") %>%
# Calculate the max salary by player
group_by(playerID) %>%
summarize(max_salary = max(salary, na.rm = TRUE)) %>%
# Calculate the average of the max salaries
summarize(avg_salary = mean(max_salary, na.rm = TRUE))
## # A tibble: 1 x 1
## avg_salary
## <dbl>
## 1 5230273.
# Repeat for players who were inducted
Salaries %>%
semi_join(inducted, by = "playerID") %>%
group_by(playerID) %>%
summarize(max_salary = max(salary, na.rm = TRUE)) %>%
summarize(avg_salary = mean(max_salary, na.rm = TRUE))
## # A tibble: 1 x 1
## avg_salary
## <dbl>
## 1 6092038.
Appearances %>%
# Filter Appearances against nominated
semi_join(nominated, by = "playerID") %>%
# Find last year played by player
group_by(playerID) %>%
summarize(last_year = max(yearID)) %>%
# Join to full HallOfFame
left_join(HallOfFame, by = "playerID") %>%
# Filter for unusual observations
filter((yearID - last_year)<5)
## # A tibble: 194 x 10
## playerID last_year yearID votedBy ballots needed votes inducted
## <chr> <dbl> <int> <chr> <int> <int> <int> <fct>
## 1 altroni01 1933. 1937 BBWAA 201 151 3 N
## 2 applilu01 1950. 1953 BBWAA 264 198 2 N
## 3 bartedi01 1946. 1948 BBWAA 121 91 1 N
## 4 beckro01 2004. 2008 BBWAA 543 408 2 N
## 5 boudrlo01 1952. 1956 BBWAA 193 145 2 N
## 6 camildo01 1945. 1948 BBWAA 121 91 1 N
## 7 chandsp01 1947. 1950 BBWAA 168 126 2 N
## 8 chandsp01 1947. 1951 BBWAA 226 170 1 N
## 9 chapmbe01 1946. 1949 BBWAA 153 115 1 N
## 10 cissebi01 1938. 1937 BBWAA 201 151 1 N
## # ... with 184 more rows, and 2 more variables: category <fct>,
## # needed_note <chr>
Data visualization is about exploratory analysis (investigative) and explanatory analysis.
There are seven grammatical layers of plots; three are required: data, aesthetics, and geometries. The other elements are facets (subplots), statistics (e.g., fitted lines), coordinates, and themes. The grammar of graphics is implemented in the ggplot2
package.
Base r provides plotting functionality, but it comes with limitations. The plot is an image, not an object, so you cannot manipulate it further. It does not present a legend. There is a separate function for each plot type. The lack of a unified framework means you will have to learn each plot type separately: points()
, hist()
, etc.
Scale the x axis with a scale_x_log10
layer. There are two main reasons to use logarithmic scales in charts and graphs. The first is to respond to skewness towards large values; i.e., cases in which one or a few points are much larger than the bulk of the data. The second is to show percent change or multiplicative factors. On a scaled access with base 2, the value of each tick mark is double the value of the preceding one. An example of a multiplicative factor is constant acceleration. More on scales for continuous data here.
For scatterplots, map x
, y
, color
, and shape
in the aesthetic layer. Map size
, fill
, shape
, alpha
(transparency), and position
(e.g., “jitter”) in the geom_point
layer.
mtcars$cyl <- as.factor(mtcars$cyl)
# Use base r to create plots with a series for each cyl value.
# Add a linear fit line through the points, one for each series, and one overall.
plot(mtcars$wt, mtcars$mpg, col = factor(mtcars$cyl))
abline(lm(mpg ~ wt, data = mtcars), lty = 2)
lapply(mtcars$cyl, function(x) {
abline(lm(mpg ~ wt, mtcars, subset = (cyl == x)), col = x)
})
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
##
## [[10]]
## NULL
##
## [[11]]
## NULL
##
## [[12]]
## NULL
##
## [[13]]
## NULL
##
## [[14]]
## NULL
##
## [[15]]
## NULL
##
## [[16]]
## NULL
##
## [[17]]
## NULL
##
## [[18]]
## NULL
##
## [[19]]
## NULL
##
## [[20]]
## NULL
##
## [[21]]
## NULL
##
## [[22]]
## NULL
##
## [[23]]
## NULL
##
## [[24]]
## NULL
##
## [[25]]
## NULL
##
## [[26]]
## NULL
##
## [[27]]
## NULL
##
## [[28]]
## NULL
##
## [[29]]
## NULL
##
## [[30]]
## NULL
##
## [[31]]
## NULL
##
## [[32]]
## NULL
legend(x = 5, y = 33, legend = levels(mtcars$cyl),
col = 1:3, pch = 1, bty = "n")
# Again in ggplot2
# The first geom_smooth inherits the ggplot color aesthetic as its group.
# The second geom_smooth explicity sets group to a dummy 1. The col = "All" adds it to the legend.
# When mapping onto color you can sometimes treat a continuous scale, like year, as an ordinal variable, but only if it is a regular series. The better alternative is to leave it as a continuous variable and use the group aesthetic as a factor to make sure your plot is drawn correctly.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
ggplot(mtcars, aes(x = wt, y = mpg, col = cyl, group = factor(cyl))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
geom_smooth(method = "lm", se = FALSE, linetype = 2, aes(group = 1, col = "All"))
ggplot
can visualize four attributes at once with x
, y
, col
, and facet_grid
. Such graphing requires tidy data, which in turn requires thoughtful definitions of metrics. In the iris data set, if measuring length vs width, then those are separate variables (cols). If measuring length (or width) vs species, then species is a variable. If measuring length (or width) vs part of flower (petal vs sepal), then flower part is a variable. To look at all four together, then length and width are members of the measure variable (because length and width share units).
library(ggplot2)
library(tidyr)
iris.tidy <- iris %>%
# gather(data, key, value, <cols>)
# Transpose all cols to rows except the identifier cols (Species)
# The former call name becomes a value in the key column.
gather(key, Value, -Species) %>%
# separate(data, col, into, sep)
separate(col = key, into = c("Part", "Measure"), sep = "\\.")
# If we want the ploy Length vs width, then each should be a column.
iris$Flower <- 1:nrow(iris)
iris.wide <- iris %>%
gather(key, value, -Species, -Flower) %>%
separate(key, c("Part", "Measure"), "\\.") %>%
spread(Measure, value)
ggplot(iris.wide, aes(x = Length, y = Width, color = Part)) +
geom_jitter() +
facet_grid(. ~ Species)
Typical aesthetics are x
, y
, colour
, fill
, size
, alpha
, linetype
, labels
, and shape
. shape
s 1:20 can accept only the color
aesthetic, and shape
s 21:25 accepts both color
and fill
.
One common technique to use with solid shapes is alpha blending (i.e. adding transparency). An alternative is to use hollow shapes.
library(ggplot2)
# Basic scatter plot: wt on x-axis and mpg on y-axis; map cyl to col
ggplot(mtcars, aes(x = wt, y = mpg, color = cyl)) +
geom_point(size = 4)
# Hollow circles - an improvement
ggplot(mtcars, aes(x = wt, y = mpg, color = cyl)) +
geom_point(size = 4, shape = 1)