Load libraries wither with library()
or require()
.
String construction.
dog <- "Chester"
print(paste("you are a dog", dog))
## [1] "you are a dog Chester"
nchar(dog)
## [1] 7
Create a vector with the combine function c()
. Reference vector elements with brackets, or with element names. R compares vectors element-wise. If you compare a vector to a singe value, R will create an appropriately sized vector.
There are two types of vectors in R: atomic vectors, and lists. Atomic vectors are homogenous of one of six types: logical, integer, double, character, complex, and raw (don’t worry about the relatively uncommon complex and raw types). Lists are recursive vectors (they can contain other lists).
Vectors have two key properties: type typeof()
of length length()
. Subset a list with single brackets and extract elements with double brackets. For example,
a <- list(
a = 1:3,
b = "a string",
c = pi,
d = list(-1, -5)
)
# List d.
typeof(a[4])
## [1] "list"
# The two elements of list d.
typeof(a[[4]])
## [1] "list"
# The first element of list d.
typeof(a[[4]][1])
## [1] "list"
# The first value of list d
typeof(a[[4]][[1]])
## [1] "double"
numeric_vector <- c(1, 10, 49)
character_vector <- c("a", "b", "c")
boolean_vector <- c(TRUE, FALSE, TRUE)
character_vector[1]
## [1] "a"
boolean_vector[c(2,3)]
## [1] FALSE TRUE
boolean_vector[2:3]
## [1] FALSE TRUE
roulette_vector <- c(-24, -50, 100, -350, 10)
names(roulette_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
roulette_vector[1]
## Monday
## -24
roulette_vector["Monday"]
## Monday
## -24
# vector operations
sum(roulette_vector)
## [1] -314
mean(roulette_vector)
## [1] -62.8
# take a subset of a vector using booleans
roulette_vector[roulette_vector>0]
## Wednesday Friday
## 100 10
A matrix is a two-dimensional collection of elements. Create a matrix with the matrix(data, nrow, ncol, byrow)
function. Label the rows with rownames()
and the columns with colnames()
. Sum each row and column into vectors with rowSums()
and colSums()
. Bind rows and columns to a matrix with rbind()
and cbind()
. Reference matrix items with brackets [row, col].
# Matrix of numbers 1:20, filling one row at a time, for 5 rows and 4 columns. Specifying the number of columns is optional if number of rows is specified.
m <- matrix(1:20, byrow = TRUE, nrow = 5, ncol = 4)
rownames(m) <- c("row 1", "row 2", "row 3", "row 4", "row 5")
colnames(m) <- c("Col 1", "col 2", "col 3", "col 4")
m
## Col 1 col 2 col 3 col 4
## row 1 1 2 3 4
## row 2 5 6 7 8
## row 3 9 10 11 12
## row 4 13 14 15 16
## row 5 17 18 19 20
# Bind row sums to matrix.
m.rowSum <- rowSums(m)
cbind(m, m.rowSum)
## Col 1 col 2 col 3 col 4 m.rowSum
## row 1 1 2 3 4 10
## row 2 5 6 7 8 26
## row 3 9 10 11 12 42
## row 4 13 14 15 16 58
## row 5 17 18 19 20 74
# All rows of the second colum of m.
m[,2]
## row 1 row 2 row 3 row 4 row 5
## 2 6 10 14 18
Use nrows()
and ncols()
to determine number of rows and columns.
for (i in 1:nrow(m)) {
for (j in 1:ncol(m)) {
print(paste("On row ", i, " and column ", j, " the matrix contains ", m[i,j]))
}
}
## [1] "On row 1 and column 1 the matrix contains 1"
## [1] "On row 1 and column 2 the matrix contains 2"
## [1] "On row 1 and column 3 the matrix contains 3"
## [1] "On row 1 and column 4 the matrix contains 4"
## [1] "On row 2 and column 1 the matrix contains 5"
## [1] "On row 2 and column 2 the matrix contains 6"
## [1] "On row 2 and column 3 the matrix contains 7"
## [1] "On row 2 and column 4 the matrix contains 8"
## [1] "On row 3 and column 1 the matrix contains 9"
## [1] "On row 3 and column 2 the matrix contains 10"
## [1] "On row 3 and column 3 the matrix contains 11"
## [1] "On row 3 and column 4 the matrix contains 12"
## [1] "On row 4 and column 1 the matrix contains 13"
## [1] "On row 4 and column 2 the matrix contains 14"
## [1] "On row 4 and column 3 the matrix contains 15"
## [1] "On row 4 and column 4 the matrix contains 16"
## [1] "On row 5 and column 1 the matrix contains 17"
## [1] "On row 5 and column 2 the matrix contains 18"
## [1] "On row 5 and column 3 the matrix contains 19"
## [1] "On row 5 and column 4 the matrix contains 20"
The factor()
function converts a variable into type factor. R needs to know whether a variable is continuous or categorical. To specify an ordinal categorical variable, specify order = TRUE
and levels
.
student_status <- c("student", "not student", "student", "not student")
categorical_student <- factor(student_status)
categorical_student
## [1] student not student student not student
## Levels: not student student
temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
temperature_vector
## [1] "High" "Low" "High" "Low" "Medium"
# nominal variables are not comparable, but ordinal variables are.
temperature_vector[1] > temperature_vector[2]
## [1] FALSE
factor_temperature_vector[1] > factor_temperature_vector[2]
## [1] TRUE
# Change the level names with the levels function. Note the levels are initially in alphabetical order.
survey_vector <- c("M", "F", "F", "M", "M")
factor_survey_vector <- factor(survey_vector)
levels(factor_survey_vector) <- c("Female", "Male")
# Notice how summary treats a factor variable different from a regular variable.
summary(survey_vector)
## Length Class Mode
## 5 character character
summary(factor_survey_vector)
## Female Male
## 2 3
A dataframe is like a matrix, except each column can be a different data type. Several functions inspect data frames. * head
(tail
): by default prints the first (last) 6 rows of the dataframe * str
: prints the structure of the dataframe. Probably the first function you’ll call with a new data set. * dim
: prints the dimensions of the dataframe * colnames
: prints the column names of the dataframe * na.omit()
removes rows with NA in any column.
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
head(mtcars,6)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
dim(mtcars)
## [1] 32 11
colnames(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
Create a data frame with the data.frame()
function.
planets <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
planets_df <- data.frame(planets, type, diameter, rotation, rings)
# Select first 5 values of diameter column. The $ is a short-cut method.
planets_df[1:5,"diameter"]
## [1] 0.382 0.949 1.000 0.532 11.209
planets_df$diameter[1:5]
## [1] 0.382 0.949 1.000 0.532 11.209
Use subset()
to apply a where condition to the data frame rows. User order()
to apply an order by to the data frame.
subset(planets_df, subset = diameter < 1)
## planets type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
planets_df[order(planets_df$diameter),]
## planets type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 8 Neptune Gas giant 3.883 0.67 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 5 Jupiter Gas giant 11.209 0.41 TRUE
Construct a list of objects with list()
. Name the list items either with “=” at creation, or using names()
.
my_vector <- 1:10
my_matrix <- matrix(1:9, ncol = 3)
my_df <- mtcars[1:10,]
my_list <- list(my_vector, my_matrix, my_df)
names(my_list) <- c("vec", "mat", "df")
my_list
## $vec
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $mat
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
##
## $df
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
# or
my_list2 <- list(vec = my_vector, mat = my_matrix, df = my_df)
my_list2
## $vec
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $mat
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
##
## $df
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Reference items in a list by its component number in brackets, or name in brackets, or name after a dollar sign.
my_vector <- 1:10
my_matrix <- matrix(1:9, ncol = 3)
my_df <- mtcars[1:10,]
my_list <- list(vec = my_vector, mat = my_matrix, df = my_df)
# Third col of second element of my_list (my_matrix)
my_list[[2]][,3]
## [1] 7 8 9
my_list$mat[,3]
## [1] 7 8 9
Append to a list with combine c()
.
my_vector <- 1:10
my_matrix <- matrix(1:9, ncol = 3)
my_df <- mtcars[1:10,]
my_list <- list(vec = my_vector, mat = my_matrix, df = my_df)
my_list <- c(my_list, df2 = my_df)
Relational operators are ==
and !=
. Logical operators are &
, |
, and !
. Be careful to not use &&
or !!
- they evaluate only the first item in the list! Control constructs are if()
.
x <- 3
if (x %% 2 == 0) {
print("x is divisible by 2")
} else if (x %% 3 == 0) {
print("x is divisible by 3")
} else {
print("x is divisible by neither 2 nor 3")
}
## [1] "x is divisible by 3"
While loop is while() {}
. Break out of loop early with if (condition) { break()}
.
i <- 1
while (i <= 10) {
print(3 * i)
if (3 * i %% 8 == 0) {
break()
}
i <- i + 1
}
## [1] 3
## [1] 6
## [1] 9
## [1] 12
## [1] 15
## [1] 18
## [1] 21
## [1] 24
For loop is for(var in seq) {exp}
. The break
statement abandons the active loop. The next
statement skips the rest of the statements in the current loop interation.
linkedin <- c(16, 9, 13, 5, 2, 17, 14)
# Loop version 1
for(views in linkedin) {
print(views)
if (views > 10) {
break
} else if (view < 5) {
next
}
}
## [1] 16
# Loop version 2
for(i in 1:length(linkedin)) {
print(linkedin[i])
}
## [1] 16
## [1] 9
## [1] 13
## [1] 5
## [1] 2
## [1] 17
## [1] 14
# seq_along handles zero-length vectors and lists.
for (i in seq_along(linkedin)) {
print(linkedin[i])
}
## [1] 16
## [1] 9
## [1] 13
## [1] 5
## [1] 2
## [1] 17
## [1] 14
Get help on a function with help
or ?
, or args
. Specify function parameters either by name or by position. When the documentation specifies default values, they are not required.
#help(mean)
#?mean
args(mean)
## function (x, ...)
## NULL
grades <- c(8.5, 7, 9, 5.5, 6)
mean(x=grades)
## [1] 7.2
mean(grades)
## [1] 7.2
Define a custom function with the function()
code chunk. The return
statement returns and exits immediately and is optional. Set default argument value with =
.
multiply_a_b <- function(a, b = 1) {
return (a * b)
}
result <- multiply_a_b(a = 3, b = 7)
Install a package with install.packages(arg)
. Packages are located at the Comprehensive R Archive Network (CRAN). Search for packages with search()
. R attaches seven packages to its search list by default. Attach more packages with library()
or require()
.
Function lapply(X, FUN, ...)
applies a function to a list. lapply()
returns a list, so if X
is a vector, cast the function result back to list with unlist
. If the function requires arguments, pass them in as additional arguments to lapply()
. Functions can be named or anonymous, so if used only once, define the function within lapply()
.
lapply(list(1,2,3), function(x) { 3 * x })
## [[1]]
## [1] 3
##
## [[2]]
## [1] 6
##
## [[3]]
## [1] 9
Function sapply()
calls lapply()
then converts the list to a one-dimensional array (vector) or two-dimensional array (matrix). If sapply
cannot simplify because the resulting list contains vectors of varying lengths, then sapply()
returns the same result as lapply()
.
Function vapply()
uses lapply()
but with FUN.VALUE
which indicates the return variable type. vapply()
is a safe alternative to sapply()
.
purrr
PackageThe purrr
package maps functions to a vector and return a vector. map()
returns a list; the others are map_dbl()
, map_lgl()
, map_int()
, and map_chr()
. The purrr
functions provide shortcuts for the f argument, are more consistant than lapply and sapply, and handle iteration well.
library(purrr)
## Warning: package 'purrr' was built under R version 3.4.4
cyl <- split(mtcars, mtcars$cyl)
# Regress mpg ~ wt on each cylinder class
map(cyl, function(df) lm(mpg ~ wt, data = df))
## $`4`
##
## Call:
## lm(formula = mpg ~ wt, data = df)
##
## Coefficients:
## (Intercept) wt
## 39.571 -5.647
##
##
## $`6`
##
## Call:
## lm(formula = mpg ~ wt, data = df)
##
## Coefficients:
## (Intercept) wt
## 28.41 -2.78
##
##
## $`8`
##
## Call:
## lm(formula = mpg ~ wt, data = df)
##
## Coefficients:
## (Intercept) wt
## 23.868 -2.192
# Same thing with shortcuts
models <- map(cyl, ~ lm(mpg ~ wt, data = .))
coefs <- map(models, coef)
map(coefs, "wt")
## $`4`
## [1] -5.647025
##
## $`6`
## [1] -2.780106
##
## $`8`
## [1] -2.192438
# Or, using a single command with pipes.
mtcars %>%
split(mtcars$cyl) %>%
map(~ lm(mpg ~ wt, data = .)) %>%
map(coef) %>%
map_dbl("wt")
## 4 6 8
## -5.647025 -2.780106 -2.192438
The safely()
function returns a list with two elements: result and error for each element. possibly()
returns a default value on errors. quietly()
captures all printed output, messages, and warnings instead of capturing errors.
safe_readLines <- safely(readLines())
# Call safe_readLines() on "http://example.org"
example_lines <- safe_readLines("http://example.org")
example_lines
## $result
## NULL
##
## $error
## NULL
# Call safe_readLines() on "http://asdfasdasdkfjlda"
nonsense_lines <- safe_readLines("http://asdfasdasdkfjlda")
nonsense_lines
## $result
## NULL
##
## $error
## NULL
n <- list(5, 10, 20)
mu <- list(1, 5, 10)
sd <- list(0.1, 1, 0.1)
# iterate over the lists
pmap(list(n, mu, sd), rnorm)
## [[1]]
## [1] 1.0380868 0.9605489 1.0786154 1.0073599 1.0234126
##
## [[2]]
## [1] 4.343431 6.307386 3.939620 3.125216 7.622740 5.457172 5.548574
## [8] 4.371869 4.627905 5.260454
##
## [[3]]
## [1] 10.053020 10.053259 10.119406 9.824395 9.995872 9.749677 9.997900
## [8] 10.128129 10.115909 10.197187 10.031033 10.080599 9.935449 10.055783
## [15] 10.083899 9.935934 9.781156 10.215975 10.060304 10.016733
funs <- list("rnorm", "runif", "rexp")
rnorm_params <- list(mean = 10)
runif_params <- list(min = 0, max = 5)
rexp_params <- list(rate = 5)
params <- list(
rnorm_params,
runif_params,
rexp_params
)
# Call invoke_map() on funs supplying params and setting n to 5
invoke_map(funs, params, n = 5)
## [[1]]
## [1] 9.657600 12.019679 10.136912 11.521788 9.658688
##
## [[2]]
## [1] 1.0613833 2.0008371 1.4973380 2.9227932 0.3804437
##
## [[3]]
## [1] 0.07188987 0.07739475 0.03476835 0.33302093 0.17282787
walk()
operates just like map()
except it’s designed for functions that don’t return anything. Use walk()
for functions with side effects like printing, plotting or saving.
#?walk2
stopifnot()
is a quick way to stop a function stop if a condition fails. stopifnot() takes logical expressions as arguments and looks for any to be FALSE
.
x <- c(NA, NA, NA)
y <- c( 1, NA, NA, NA)
both_na <- function(x, y) {
stopifnot(length(x) == length(y))
sum(is.na(x) & is.na(y))
}
#both_na(x, y)
Use stop()
instead of stopifnot()
to specify a more informative error message.
x <- c(NA, NA, NA)
y <- c( 1, NA, NA, NA)
both_na <- function(x, y) {
if (length(x) != length(y)) {
stop("x and y must have the same length", call. = FALSE)
}
sum(is.na(x) & is.na(y))
}
#both_na(x, y)
R features a bunch of functions to juggle around with data structures:: seq(from = 1, to 2, by = .25)
: Generates sequence from
1 to
2 incremented by
.25. rep(x, times)
: Replicates elements of vectors and lists. sort(x)
: Sorts a vector. rev(x)
: Reverses the elements in a data structures for which reversal is defined. str(x)
: Display the structure of any R object x
. append(x y)
: Appends vectors or list y
to x
. is.*()
: Checks class of R object x
. as.*()
: Casts R object x
. unlist(x)
: Flatten (possibly embedded) lists to produce a vector.
myseq <- seq(8, 2, by=-2)
myseq
## [1] 8 6 4 2
myrep <- rep(myseq, times =2)
myrep
## [1] 8 6 4 2 8 6 4 2
myrep <- rep(myseq, each = 2)
myrep
## [1] 8 8 6 6 4 4 2 2
linkedin <- list(16, 9, 13, 5, 2, 17, 14)
facebook <- list(17, 7, 5, 16, 8, 13, 14)
li_vec <- unlist(linkedin)
fb_vec <- unlist(facebook)
social_vec <- append(li_vec, fb_vec)
sort(social_vec, decreasing = TRUE)
## [1] 17 17 16 16 14 14 13 13 9 8 7 5 5 2
Regular expressions include grepl() grepl(pattern = "a", x = animals)
returns TRUE for each element of x
matching the pattern
. Regular expression “^a” means a*; “a$” means *a; .\*
means any character zero or more times; ’\smeans space;
[0-9]+means numbers 0 to 9 at least once.
grep(pattern = “a”, x = animals)returns the vector indices for each element of
xmatching the
pattern.
sub(pattern = “a”, replacement = “o”, x = animals“)substitutes the first a with o.
gsum(pattern =”a“, replacement =”o“, x = animals”)` substitutes all a’s with o’s.)
animals <- c("cat", "moose", "impala", "ant", "kiwi")
grepl(pattern = "a", x = animals)
## [1] TRUE FALSE TRUE TRUE FALSE
which(grepl(pattern = "a", x = animals))
## [1] 1 3 4
grep(pattern = "a", x = animals)
## [1] 1 3 4
There are two datetimes in R, POSIXlt
, a list with named components, and POSIXct
, the number of seconds since 1970-01-01 00:00:00. POSIXct
is more amenable to data frames, so you will encounter it much more often. Sys.Date()
returns a Date
equal to today. Sys.time()
returns POSIXct
.
as.Date("2018-10-16")
## [1] "2018-10-16"
as.POSIXct("2018-11-28 08:34:00")
## [1] "2018-11-28 08:34:00 EST"
The simplest file to import is RData.
url_rdata <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData"
download.file(url_rdata, "Programs/Data/wine_local.RData")
# loading wine_local.RData creates variable wine.
load("Programs/Data/wine_local.RData")
summary(wine)
## Alcohol Malic acid Ash Alcalinity of ash
## Min. :11.03 Min. :0.74 Min. :1.360 Min. :10.60
## 1st Qu.:12.36 1st Qu.:1.60 1st Qu.:2.210 1st Qu.:17.20
## Median :13.05 Median :1.87 Median :2.360 Median :19.50
## Mean :12.99 Mean :2.34 Mean :2.366 Mean :19.52
## 3rd Qu.:13.67 3rd Qu.:3.10 3rd Qu.:2.560 3rd Qu.:21.50
## Max. :14.83 Max. :5.80 Max. :3.230 Max. :30.00
## Magnesium Total phenols Flavanoids Nonflavanoid phenols
## Min. : 70.00 Min. :0.980 Min. :0.340 Min. :0.1300
## 1st Qu.: 88.00 1st Qu.:1.740 1st Qu.:1.200 1st Qu.:0.2700
## Median : 98.00 Median :2.350 Median :2.130 Median :0.3400
## Mean : 99.59 Mean :2.292 Mean :2.023 Mean :0.3623
## 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.860 3rd Qu.:0.4400
## Max. :162.00 Max. :3.880 Max. :5.080 Max. :0.6600
## Proanthocyanins Color intensity Hue Proline
## Min. :0.410 Min. : 1.280 Min. :1.270 Min. : 278.0
## 1st Qu.:1.250 1st Qu.: 3.210 1st Qu.:1.930 1st Qu.: 500.0
## Median :1.550 Median : 4.680 Median :2.780 Median : 672.0
## Mean :1.587 Mean : 5.055 Mean :2.604 Mean : 745.1
## 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:3.170 3rd Qu.: 985.0
## Max. :3.580 Max. :13.000 Max. :4.000 Max. :1680.0
# or, equivalently,
load(url("https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData"))
summary(wine)
## Alcohol Malic acid Ash Alcalinity of ash
## Min. :11.03 Min. :0.74 Min. :1.360 Min. :10.60
## 1st Qu.:12.36 1st Qu.:1.60 1st Qu.:2.210 1st Qu.:17.20
## Median :13.05 Median :1.87 Median :2.360 Median :19.50
## Mean :12.99 Mean :2.34 Mean :2.366 Mean :19.52
## 3rd Qu.:13.67 3rd Qu.:3.10 3rd Qu.:2.560 3rd Qu.:21.50
## Max. :14.83 Max. :5.80 Max. :3.230 Max. :30.00
## Magnesium Total phenols Flavanoids Nonflavanoid phenols
## Min. : 70.00 Min. :0.980 Min. :0.340 Min. :0.1300
## 1st Qu.: 88.00 1st Qu.:1.740 1st Qu.:1.200 1st Qu.:0.2700
## Median : 98.00 Median :2.350 Median :2.130 Median :0.3400
## Mean : 99.59 Mean :2.292 Mean :2.023 Mean :0.3623
## 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.860 3rd Qu.:0.4400
## Max. :162.00 Max. :3.880 Max. :5.080 Max. :0.6600
## Proanthocyanins Color intensity Hue Proline
## Min. :0.410 Min. : 1.280 Min. :1.270 Min. : 278.0
## 1st Qu.:1.250 1st Qu.: 3.210 1st Qu.:1.930 1st Qu.: 500.0
## Median :1.550 Median : 4.680 Median :2.780 Median : 672.0
## Mean :1.587 Mean : 5.055 Mean :2.604 Mean : 745.1
## 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:3.170 3rd Qu.: 985.0
## Max. :3.580 Max. :13.000 Max. :4.000 Max. :1680.0
There are three common packages designed to load flat files: util
which comes with base r, readr
, and data.table
.
util
The base r util
package includes flat file reading functions. read.table()
is a generic flat file loading function. Wrapper functions read.csv()
reads comma-separated files, and read.delim
reads tab-delimited files.
stringsAsFactors = TRUE
treats string variables as categorical.col.names = c()
overrides, or sets, column names.colClasses = c()
sets data types. NULL elements in the vector drop the variable.# Opt 1: set working dir to file location
# setwd("C:/Users/mpfol/OneDrive/Documents/Data Analysis/Data")
# Opt 2: define a file path relative to script file.
path <- file.path("Data", "swimming_pools.csv")
swimming_pools <- read.csv(path, stringsAsFactors = FALSE)
swimming_pools <- read.table(path,
sep = ",",
header = TRUE,
col.names = c("name", "address", "ph", "ph2", "open_hr","facilities", "disabl","park","lat","longit"),
colClasses = c("factor", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "numeric", "numeric"))
readr
readr
is similar to utils
, but is faster and less verbose. readr
returns a “tibble” instead of a data frame. Functions read_csv()
and read_tsv()
are wrappers for read_delim()
, similar to the construction in package utils
.
col_names = TRUE
sets column names to the first row of data. Set col_names = FALSE
for system-generated names or set col_names = c()
to set the column names to a character vector.col_types = c()
sets data types. NULL elements in the vector drop the variable. Use shorthand strings where col_types = "cd_il")
means “character, double, (skip), integer, logical”.col_factor()
and col_integer()
also set column types.library(readr)
pools <- file.path("Programs/Data", "swimming_pools.csv")
# or, if on the web,
pools.path <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"
pools <- read_csv(pools.path)
## Parsed with column specification:
## cols(
## Name = col_character(),
## Address = col_character(),
## Latitude = col_double(),
## Longitude = col_double()
## )
potatoes.path <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/potatoes.txt"
potatoes <- read_delim(potatoes.path, delim = "\t")
## Parsed with column specification:
## cols(
## area = col_integer(),
## temp = col_integer(),
## size = col_integer(),
## storage = col_integer(),
## method = col_integer(),
## texture = col_double(),
## flavor = col_double(),
## moistness = col_double()
## )
machine <- file.path("Programs/Data", "machine.txt")
properties <- c("new", "old")
machine.fragment <- read_tsv(machine, skip = 6, n_max = 5,
col_names = properties)
## Parsed with column specification:
## cols(
## new = col_double(),
## old = col_double()
## )
hotdogs <- file.path("Programs/Data", "hotdogs.txt")
hotdogs_factor <- read_tsv(hotdogs,
col_names = c("type", "calories", "sodium"),
skip = 1)
## Parsed with column specification:
## cols(
## type = col_character(),
## calories = col_double(),
## sodium = col_double()
## )
data.table
The data.table
package is optimized for large files. fread()
is faster and more convenient than read.table
.
library(data.table)
## Warning: package 'data.table' was built under R version 3.4.4
##
## Attaching package: 'data.table'
## The following object is masked from 'package:purrr':
##
## transpose
pools <- file.path("Programs/Data", "swimming_pools.csv")
machine <- file.path("Programs/Data", "machine.txt")
properties <- c("new", "old")
machine.fragment <- fread(machine)
There are three packages to choose from, readxl
, gdata
, and XLConnect
. gdata
only handles .xls files and will be replaced when readxl
is more mature. XLConnect
is designed to work with Excel through R.
readxl
readxl
cannot read directly from the internet. First download the file, then import the file.
Packagage readxl
functions excel_sheets()
lists the available sheets, read_excel()
reads the file.
col_names = TRUE
sets column names to the first row of data. Set col_names = FALSE
for system-generated names or set col_names = c()
to set the column names to a character vector.col_types = c()
sets data types. “blank” elements in the vector drop the variable.skip
skips lines. If first line is column names, you will have to manually set it.library(readxl)
## Warning: package 'readxl' was built under R version 3.4.4
url_xls <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls"
download.file(url_xls, file.path("Programs/Data", "local_latitude.xls"))
#excel_readxl <- read_excel(file.path("C:/Users/mpfol/OneDrive/Documents/Data Analysis/Programs/Data", "local_latitude.xls"))
mini.path <- file.path("Programs/Data", "MinitabIntroData.xlsx")
excel_sheets(mini.path)
## [1] "Sheet1" "Sheet2"
sheet1 <- read_excel(mini.path, sheet = "Sheet1")
sheet2 <- read_excel(mini.path, sheet = "Sheet2")
sheet.list = list(sheet1, sheet2)
# Equivalently...
sheet.list <- lapply(excel_sheets(mini.path),
read_excel, path = mini.path)
gdata
gdata
requires perl in the background. It can only read .xls
files. It can read directly from web sites though.
library(gdata)
## Warning: package 'gdata' was built under R version 3.4.4
## gdata: Unable to locate valid perl interpreter
## gdata:
## gdata: read.xls() will be unable to read Excel XLS and XLSX files
## gdata: unless the 'perl=' argument is used to specify the location
## gdata: of a valid perl intrpreter.
## gdata:
## gdata: (To avoid display of this message in the future, please
## gdata: ensure perl is installed and available on the executable
## gdata: search path.)
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLX' (Excel 97-2004) files.
##
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLSX' (Excel 2007+) files.
##
## gdata: Run the function 'installXLSXsupport()'
## gdata: to automatically download and install the perl
## gdata: libaries needed to support Excel XLS and XLSX formats.
##
## Attaching package: 'gdata'
## The following objects are masked from 'package:data.table':
##
## first, last
## The following object is masked from 'package:purrr':
##
## keep
## The following object is masked from 'package:stats':
##
## nobs
## The following object is masked from 'package:utils':
##
## object.size
## The following object is masked from 'package:base':
##
## startsWith
url_xls <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls"
#read.xls(url_xls)
#library(XLConnect)
mini.path <- file.path("Programs/Data", "MinitabIntroData.xlsx")
#my_book <- loadWorkbook(mini.path)
#class(my_book)
#getSheets(my_book)
#readWorksheet(my_book, sheet = 2)
#all <- lapply(sheets, readWorksheet, object = my_book)
#str(all)
#createSheet(my_book, name = "year_2010")
#writeWorksheet(my_book, pop_2010, sheet = "year_2010")
#saveWorkbook(my_book, file = "MinitabIntroData2.xlsx")
There is a dedicated package for each DBMS: RMySQL
, RPostgresSQL
, ROracle
, etc. Function dbGetQuery()
is a convenient aggregator of three functions, dbSendQuery()
, dbFetch()
, and dbClearResults()
. Use the three functions if the data set is large and only a chunk of data is needed at a time.
library(DBI)
## Warning: package 'DBI' was built under R version 3.4.4
con <- dbConnect(RMySQL::MySQL(),
dbname = "tweater",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
con
## <MySQLConnection:0,0>
# read all tables into a list of data frames
table_names <- dbListTables(con)
tables <- lapply(table_names, dbReadTable, conn = con)
# read an entire table, then subset the rows you want (inefficient)
comments <- dbReadTable(con, "comments")
subset(comments,
subset = user_id == 1,
tweat_id = 77)
## id tweat_id user_id message
## 4 1012 87 1 awesome! thanks!
## 7 1004 49 1 this is fabulous!
## 11 1020 77 1 couldn't be better
## 12 1014 77 1 saved my day
elisabeth <- dbGetQuery(con, "SELECT tweat_id FROM comments
WHERE user_id = 1")
latest <- dbGetQuery(con, "SELECT post FROM tweats WHERE date > \"2015-09-21\"")
dbDisconnect(con)
## [1] TRUE
If a file resides on the web, reference it directly instead of manually downloading. For the excel
package, you will have to first download the file.
url = "http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r"
dest_path = file.path("~", "local_cities.xlsx")
#download.file(url, dest_path)
The httr
package also handles internet files.
library(httr)
## Warning: package 'httr' was built under R version 3.4.4
resp <- GET("http://www.example.com/")
raw_content <- content(resp, as = "raw")
head(raw_content)
## [1] 3c 21 64 6f 63 74
JSON files are either name-value pair objects {“id”:1,“name”:“Frank”}, or arrays [1,2,3,“dog”].
library(jsonlite)
## Warning: package 'jsonlite' was built under R version 3.4.4
##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
## flatten
wine_json <- '{"name":"Chateau Migraine", "year":1997, "alcohol_pct":12.4, "color":"red", "awarded":false}'
# Convert file JSON into list
wine <- fromJSON(wine_json)
str(wine)
## List of 5
## $ name : chr "Chateau Migraine"
## $ year : int 1997
## $ alcohol_pct: num 12.4
## $ color : chr "red"
## $ awarded : logi FALSE
# Convert web API JSON into list
url_sw4 <- "http://www.omdbapi.com/?apikey=ff21610b&i=tt0076759&r=json"
url_sw3 <- "http://www.omdbapi.com/?apikey=ff21610b&i=tt0121766&r=json"
# Import two URLs with fromJSON(): sw4 and sw3
#sw4 <- fromJSON(url_sw4)
#sw3 <- fromJSON(url_sw3)
# Print the Title element of both lists
#sw4$Title
#sw3$Title
# Convert mtcars to a pretty JSON: pretty_json
pretty_json <- toJSON(mtcars, pretty = TRUE)
pretty_json
## [
## {
## "mpg": 21,
## "cyl": 6,
## "disp": 160,
## "hp": 110,
## "drat": 3.9,
## "wt": 2.62,
## "qsec": 16.46,
## "vs": 0,
## "am": 1,
## "gear": 4,
## "carb": 4,
## "_row": "Mazda RX4"
## },
## {
## "mpg": 21,
## "cyl": 6,
## "disp": 160,
## "hp": 110,
## "drat": 3.9,
## "wt": 2.875,
## "qsec": 17.02,
## "vs": 0,
## "am": 1,
## "gear": 4,
## "carb": 4,
## "_row": "Mazda RX4 Wag"
## },
## {
## "mpg": 22.8,
## "cyl": 4,
## "disp": 108,
## "hp": 93,
## "drat": 3.85,
## "wt": 2.32,
## "qsec": 18.61,
## "vs": 1,
## "am": 1,
## "gear": 4,
## "carb": 1,
## "_row": "Datsun 710"
## },
## {
## "mpg": 21.4,
## "cyl": 6,
## "disp": 258,
## "hp": 110,
## "drat": 3.08,
## "wt": 3.215,
## "qsec": 19.44,
## "vs": 1,
## "am": 0,
## "gear": 3,
## "carb": 1,
## "_row": "Hornet 4 Drive"
## },
## {
## "mpg": 18.7,
## "cyl": 8,
## "disp": 360,
## "hp": 175,
## "drat": 3.15,
## "wt": 3.44,
## "qsec": 17.02,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 2,
## "_row": "Hornet Sportabout"
## },
## {
## "mpg": 18.1,
## "cyl": 6,
## "disp": 225,
## "hp": 105,
## "drat": 2.76,
## "wt": 3.46,
## "qsec": 20.22,
## "vs": 1,
## "am": 0,
## "gear": 3,
## "carb": 1,
## "_row": "Valiant"
## },
## {
## "mpg": 14.3,
## "cyl": 8,
## "disp": 360,
## "hp": 245,
## "drat": 3.21,
## "wt": 3.57,
## "qsec": 15.84,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 4,
## "_row": "Duster 360"
## },
## {
## "mpg": 24.4,
## "cyl": 4,
## "disp": 146.7,
## "hp": 62,
## "drat": 3.69,
## "wt": 3.19,
## "qsec": 20,
## "vs": 1,
## "am": 0,
## "gear": 4,
## "carb": 2,
## "_row": "Merc 240D"
## },
## {
## "mpg": 22.8,
## "cyl": 4,
## "disp": 140.8,
## "hp": 95,
## "drat": 3.92,
## "wt": 3.15,
## "qsec": 22.9,
## "vs": 1,
## "am": 0,
## "gear": 4,
## "carb": 2,
## "_row": "Merc 230"
## },
## {
## "mpg": 19.2,
## "cyl": 6,
## "disp": 167.6,
## "hp": 123,
## "drat": 3.92,
## "wt": 3.44,
## "qsec": 18.3,
## "vs": 1,
## "am": 0,
## "gear": 4,
## "carb": 4,
## "_row": "Merc 280"
## },
## {
## "mpg": 17.8,
## "cyl": 6,
## "disp": 167.6,
## "hp": 123,
## "drat": 3.92,
## "wt": 3.44,
## "qsec": 18.9,
## "vs": 1,
## "am": 0,
## "gear": 4,
## "carb": 4,
## "_row": "Merc 280C"
## },
## {
## "mpg": 16.4,
## "cyl": 8,
## "disp": 275.8,
## "hp": 180,
## "drat": 3.07,
## "wt": 4.07,
## "qsec": 17.4,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 3,
## "_row": "Merc 450SE"
## },
## {
## "mpg": 17.3,
## "cyl": 8,
## "disp": 275.8,
## "hp": 180,
## "drat": 3.07,
## "wt": 3.73,
## "qsec": 17.6,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 3,
## "_row": "Merc 450SL"
## },
## {
## "mpg": 15.2,
## "cyl": 8,
## "disp": 275.8,
## "hp": 180,
## "drat": 3.07,
## "wt": 3.78,
## "qsec": 18,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 3,
## "_row": "Merc 450SLC"
## },
## {
## "mpg": 10.4,
## "cyl": 8,
## "disp": 472,
## "hp": 205,
## "drat": 2.93,
## "wt": 5.25,
## "qsec": 17.98,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 4,
## "_row": "Cadillac Fleetwood"
## },
## {
## "mpg": 10.4,
## "cyl": 8,
## "disp": 460,
## "hp": 215,
## "drat": 3,
## "wt": 5.424,
## "qsec": 17.82,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 4,
## "_row": "Lincoln Continental"
## },
## {
## "mpg": 14.7,
## "cyl": 8,
## "disp": 440,
## "hp": 230,
## "drat": 3.23,
## "wt": 5.345,
## "qsec": 17.42,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 4,
## "_row": "Chrysler Imperial"
## },
## {
## "mpg": 32.4,
## "cyl": 4,
## "disp": 78.7,
## "hp": 66,
## "drat": 4.08,
## "wt": 2.2,
## "qsec": 19.47,
## "vs": 1,
## "am": 1,
## "gear": 4,
## "carb": 1,
## "_row": "Fiat 128"
## },
## {
## "mpg": 30.4,
## "cyl": 4,
## "disp": 75.7,
## "hp": 52,
## "drat": 4.93,
## "wt": 1.615,
## "qsec": 18.52,
## "vs": 1,
## "am": 1,
## "gear": 4,
## "carb": 2,
## "_row": "Honda Civic"
## },
## {
## "mpg": 33.9,
## "cyl": 4,
## "disp": 71.1,
## "hp": 65,
## "drat": 4.22,
## "wt": 1.835,
## "qsec": 19.9,
## "vs": 1,
## "am": 1,
## "gear": 4,
## "carb": 1,
## "_row": "Toyota Corolla"
## },
## {
## "mpg": 21.5,
## "cyl": 4,
## "disp": 120.1,
## "hp": 97,
## "drat": 3.7,
## "wt": 2.465,
## "qsec": 20.01,
## "vs": 1,
## "am": 0,
## "gear": 3,
## "carb": 1,
## "_row": "Toyota Corona"
## },
## {
## "mpg": 15.5,
## "cyl": 8,
## "disp": 318,
## "hp": 150,
## "drat": 2.76,
## "wt": 3.52,
## "qsec": 16.87,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 2,
## "_row": "Dodge Challenger"
## },
## {
## "mpg": 15.2,
## "cyl": 8,
## "disp": 304,
## "hp": 150,
## "drat": 3.15,
## "wt": 3.435,
## "qsec": 17.3,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 2,
## "_row": "AMC Javelin"
## },
## {
## "mpg": 13.3,
## "cyl": 8,
## "disp": 350,
## "hp": 245,
## "drat": 3.73,
## "wt": 3.84,
## "qsec": 15.41,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 4,
## "_row": "Camaro Z28"
## },
## {
## "mpg": 19.2,
## "cyl": 8,
## "disp": 400,
## "hp": 175,
## "drat": 3.08,
## "wt": 3.845,
## "qsec": 17.05,
## "vs": 0,
## "am": 0,
## "gear": 3,
## "carb": 2,
## "_row": "Pontiac Firebird"
## },
## {
## "mpg": 27.3,
## "cyl": 4,
## "disp": 79,
## "hp": 66,
## "drat": 4.08,
## "wt": 1.935,
## "qsec": 18.9,
## "vs": 1,
## "am": 1,
## "gear": 4,
## "carb": 1,
## "_row": "Fiat X1-9"
## },
## {
## "mpg": 26,
## "cyl": 4,
## "disp": 120.3,
## "hp": 91,
## "drat": 4.43,
## "wt": 2.14,
## "qsec": 16.7,
## "vs": 0,
## "am": 1,
## "gear": 5,
## "carb": 2,
## "_row": "Porsche 914-2"
## },
## {
## "mpg": 30.4,
## "cyl": 4,
## "disp": 95.1,
## "hp": 113,
## "drat": 3.77,
## "wt": 1.513,
## "qsec": 16.9,
## "vs": 1,
## "am": 1,
## "gear": 5,
## "carb": 2,
## "_row": "Lotus Europa"
## },
## {
## "mpg": 15.8,
## "cyl": 8,
## "disp": 351,
## "hp": 264,
## "drat": 4.22,
## "wt": 3.17,
## "qsec": 14.5,
## "vs": 0,
## "am": 1,
## "gear": 5,
## "carb": 4,
## "_row": "Ford Pantera L"
## },
## {
## "mpg": 19.7,
## "cyl": 6,
## "disp": 145,
## "hp": 175,
## "drat": 3.62,
## "wt": 2.77,
## "qsec": 15.5,
## "vs": 0,
## "am": 1,
## "gear": 5,
## "carb": 6,
## "_row": "Ferrari Dino"
## },
## {
## "mpg": 15,
## "cyl": 8,
## "disp": 301,
## "hp": 335,
## "drat": 3.54,
## "wt": 3.57,
## "qsec": 14.6,
## "vs": 0,
## "am": 1,
## "gear": 5,
## "carb": 8,
## "_row": "Maserati Bora"
## },
## {
## "mpg": 21.4,
## "cyl": 4,
## "disp": 121,
## "hp": 109,
## "drat": 4.11,
## "wt": 2.78,
## "qsec": 18.6,
## "vs": 1,
## "am": 1,
## "gear": 4,
## "carb": 2,
## "_row": "Volvo 142E"
## }
## ]
# Minify pretty_json: mini_json
mini_json <- minify(pretty_json)
mini_json
## [{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.62,"qsec":16.46,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4"},{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.875,"qsec":17.02,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4 Wag"},{"mpg":22.8,"cyl":4,"disp":108,"hp":93,"drat":3.85,"wt":2.32,"qsec":18.61,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Datsun 710"},{"mpg":21.4,"cyl":6,"disp":258,"hp":110,"drat":3.08,"wt":3.215,"qsec":19.44,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Hornet 4 Drive"},{"mpg":18.7,"cyl":8,"disp":360,"hp":175,"drat":3.15,"wt":3.44,"qsec":17.02,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Hornet Sportabout"},{"mpg":18.1,"cyl":6,"disp":225,"hp":105,"drat":2.76,"wt":3.46,"qsec":20.22,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Valiant"},{"mpg":14.3,"cyl":8,"disp":360,"hp":245,"drat":3.21,"wt":3.57,"qsec":15.84,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Duster 360"},{"mpg":24.4,"cyl":4,"disp":146.7,"hp":62,"drat":3.69,"wt":3.19,"qsec":20,"vs":1,"am":0,"gear":4,"carb":2,"_row":"Merc 240D"},{"mpg":22.8,"cyl":4,"disp":140.8,"hp":95,"drat":3.92,"wt":3.15,"qsec":22.9,"vs":1,"am":0,"gear":4,"carb":2,"_row":"Merc 230"},{"mpg":19.2,"cyl":6,"disp":167.6,"hp":123,"drat":3.92,"wt":3.44,"qsec":18.3,"vs":1,"am":0,"gear":4,"carb":4,"_row":"Merc 280"},{"mpg":17.8,"cyl":6,"disp":167.6,"hp":123,"drat":3.92,"wt":3.44,"qsec":18.9,"vs":1,"am":0,"gear":4,"carb":4,"_row":"Merc 280C"},{"mpg":16.4,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":4.07,"qsec":17.4,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SE"},{"mpg":17.3,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":3.73,"qsec":17.6,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SL"},{"mpg":15.2,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":3.78,"qsec":18,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SLC"},{"mpg":10.4,"cyl":8,"disp":472,"hp":205,"drat":2.93,"wt":5.25,"qsec":17.98,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Cadillac Fleetwood"},{"mpg":10.4,"cyl":8,"disp":460,"hp":215,"drat":3,"wt":5.424,"qsec":17.82,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Lincoln Continental"},{"mpg":14.7,"cyl":8,"disp":440,"hp":230,"drat":3.23,"wt":5.345,"qsec":17.42,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Chrysler Imperial"},{"mpg":32.4,"cyl":4,"disp":78.7,"hp":66,"drat":4.08,"wt":2.2,"qsec":19.47,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Fiat 128"},{"mpg":30.4,"cyl":4,"disp":75.7,"hp":52,"drat":4.93,"wt":1.615,"qsec":18.52,"vs":1,"am":1,"gear":4,"carb":2,"_row":"Honda Civic"},{"mpg":33.9,"cyl":4,"disp":71.1,"hp":65,"drat":4.22,"wt":1.835,"qsec":19.9,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Toyota Corolla"},{"mpg":21.5,"cyl":4,"disp":120.1,"hp":97,"drat":3.7,"wt":2.465,"qsec":20.01,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Toyota Corona"},{"mpg":15.5,"cyl":8,"disp":318,"hp":150,"drat":2.76,"wt":3.52,"qsec":16.87,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Dodge Challenger"},{"mpg":15.2,"cyl":8,"disp":304,"hp":150,"drat":3.15,"wt":3.435,"qsec":17.3,"vs":0,"am":0,"gear":3,"carb":2,"_row":"AMC Javelin"},{"mpg":13.3,"cyl":8,"disp":350,"hp":245,"drat":3.73,"wt":3.84,"qsec":15.41,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Camaro Z28"},{"mpg":19.2,"cyl":8,"disp":400,"hp":175,"drat":3.08,"wt":3.845,"qsec":17.05,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Pontiac Firebird"},{"mpg":27.3,"cyl":4,"disp":79,"hp":66,"drat":4.08,"wt":1.935,"qsec":18.9,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Fiat X1-9"},{"mpg":26,"cyl":4,"disp":120.3,"hp":91,"drat":4.43,"wt":2.14,"qsec":16.7,"vs":0,"am":1,"gear":5,"carb":2,"_row":"Porsche 914-2"},{"mpg":30.4,"cyl":4,"disp":95.1,"hp":113,"drat":3.77,"wt":1.513,"qsec":16.9,"vs":1,"am":1,"gear":5,"carb":2,"_row":"Lotus Europa"},{"mpg":15.8,"cyl":8,"disp":351,"hp":264,"drat":4.22,"wt":3.17,"qsec":14.5,"vs":0,"am":1,"gear":5,"carb":4,"_row":"Ford Pantera L"},{"mpg":19.7,"cyl":6,"disp":145,"hp":175,"drat":3.62,"wt":2.77,"qsec":15.5,"vs":0,"am":1,"gear":5,"carb":6,"_row":"Ferrari Dino"},{"mpg":15,"cyl":8,"disp":301,"hp":335,"drat":3.54,"wt":3.57,"qsec":14.6,"vs":0,"am":1,"gear":5,"carb":8,"_row":"Maserati Bora"},{"mpg":21.4,"cyl":4,"disp":121,"hp":109,"drat":4.11,"wt":2.78,"qsec":18.6,"vs":1,"am":1,"gear":4,"carb":2,"_row":"Volvo 142E"}]
haven
and foreign
R supports SAS, STATA, and SPSS.
library(haven)
## Warning: package 'haven' was built under R version 3.4.4
sales <- read_sas("http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/sales.sas7bdat")
sugar <- read_dta("http://assets.datacamp.com/production/course_1478/datasets/trade.dta")
# Convert labeled values in Date column to dates
sugar$Date <- as.Date(as_factor(sugar$Date))
dat <- read_dta("http://assets.datacamp.com/production/course_1478/datasets/trade.dta")
library(foreign)
# foreign can load xprt files but not sas7dat files.
# load in the data and store it in the variable cars
cars <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/course/uva/mtcars.csv")
# print the first 6 rows of the dataset using the head() function
head(cars)
## mpg cyl disp hp drat wt qsec vs am gear carb car
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Datsun 710
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Hornet Sportabout
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 Valiant
Change the variable separator for text files with the sep
argument. Use sep = 't'
for tab.
# load in the dataset
cars <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/course/uva/mtcars_semicolon.csv", sep = ";")
# print the first 6 rows of the dataset
head(cars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Get and set you working directory.
getwd()
## [1] "C:/Users/mpfol/OneDrive/Documents/Data Analysis"
list.files()
## [1] "Analyzing Survey Data in R.Rmd"
## [2] "Analyzing_Survey_Data_in_R.html"
## [3] "Cookbook for R.Rmd"
## [4] "Cookbook_for_R.html"
## [5] "Cookbook_for_R.Rmd"
## [6] "Cookbook_for_R_files"
## [7] "Coursework"
## [8] "Data"
## [9] "Data Analysis.docx"
## [10] "Data Analysis.xlsx"
## [11] "Data Visualization.docx"
## [12] "Foundations of Inference.Rmd"
## [13] "Foundations_of_Inference.html"
## [14] "local_latitude.xls"
## [15] "Programs"
## [16] "rmarkdown-cheatsheet.pdf"
## [17] "rsconnect"
## [18] "Statistical Analysis.docx"
## [19] "Statistical Package Syntax (1).docx"
## [20] "Statistics Notes.docx"
## [21] "Statistics v20170301.docx"
Data exploration starts with evaluation of structure and characteristics using class()
(it better be a data.frame), dim()
, and names()
. Create summaries with str()
or glimpse()
, and summary()
. Run some initial visualizations for insights into distributions. Use histograms for univariate analysis, scatterplots for numeric-numeric bi-variate analysis, and boxplots for numeric-factor bi-variate analysis.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:gdata':
##
## combine, first, last
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Check structure
class(mtcars)
## [1] "data.frame"
dim(mtcars)
## [1] 32 11
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
# Initial summaries
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
glimpse(mtcars) # Slightly cleaner version of str (requires dplyr).
## Observations: 32
## Variables: 11
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
hist(mtcars$mpg)
plot(mtcars$mpg, mtcars$qsec)
# View sample data
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
tail(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
Tidy data organizes a single observational unit into rows and columns. Use the tidyr
package to tidy messy data.
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.4
wide_df <- data.frame(Obs=c(1,2),
a=c(1,4),
b=c(2,5),
c=c(3,6),
year_mo=c("2010-05","2007-07"))
wide_df
## Obs a b c year_mo
## 1 1 1 2 3 2010-05
## 2 2 4 5 6 2007-07
# Gather wide data into key-value pairs. Exclude Obs and year_mo
long_df <- gather(wide_df, my_key, my_val, -c(Obs,year_mo))
long_df
## Obs year_mo my_key my_val
## 1 1 2010-05 a 1
## 2 2 2007-07 a 4
## 3 1 2010-05 b 2
## 4 2 2007-07 b 5
## 5 1 2010-05 c 3
## 6 2 2007-07 c 6
# The opposite of gather() is spread()
wide_df <- spread(long_df, my_key, my_val)
wide_df
## Obs year_mo a b c
## 1 1 2010-05 1 2 3
## 2 2 2007-07 4 5 6
# Split a column using separate().
long_df_sep <- separate(long_df, col = year_mo, into = c("year","month"), sep = "-")
long_df_sep
## Obs year month my_key my_val
## 1 1 2010 05 a 1
## 2 2 2007 07 a 4
## 3 1 2010 05 b 2
## 4 2 2007 07 b 5
## 5 1 2010 05 c 3
## 6 2 2007 07 c 6
# The opposite of separate() is unite()
long_df_uni <- unite(long_df_sep, year_mo, year, month, sep = "-")
long_df_uni
## Obs year_mo my_key my_val
## 1 1 2010-05 a 1
## 2 2 2007-07 a 4
## 3 1 2010-05 b 2
## 4 2 2007-07 b 5
## 5 1 2010-05 c 3
## 6 2 2007-07 c 6
Types of variables in R: * character * numeric, including NaN
and inf
. * integer, denoted 123L
* factor * logical, included NA
.
Coerce variables into data types with * as.character()
* as.numeric()
* as.integer()
* as.factor()
* as.logical()
where 0 := FALSE * Package lubridate
coerces strings to dates. Valid masking characters are y
, m
, d
, h
, m
, and s
. Unite several fields into one with unite()
. Rearrange column order with select()
. Change the structure of multiple columns with mutate_at
.
Because the period (.) has special meaning in certain situations, use underscores (_) to separate words in variable names. Use all lowercase letters so that no one has to remember which letters are uppercase or lowercase.
Package lubridate
manipulates dates. Round dates with round_date
, floor_date
, and ceiling_date
. All three take a unit argument specifying the resolution of rounding: “second”, “minute”, “hour”, “day”, “week”, “month”, “bimonth”, “quarter”, “halfyear”, or “year”. Or, you can specify any multiple of those units, e.g. “5 years”, “3 minutes” etc.
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.4.4
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday,
## week, yday, year
## The following object is masked from 'package:base':
##
## date
# There 3! ymd date functions: ymd(), ydm(), mdy(), myd(), dmy(), dym().
# Create datetimes with: _h, _hm, or _hms
as.Date(ymd_hms("2005/10/23 14:40:00"))
## [1] "2005-10-23"
as.POSIXct(mdy("July 21, 2006"))
## [1] "2006-07-20 20:00:00 EDT"
ymd("2006-07-21")
## [1] "2006-07-21"
ymd("2006 Jul 21")
## [1] "2006-07-21"
mdy("July 21, 2006")
## [1] "2006-07-21"
hms("10:25:09")
## [1] "10H 25M 9S"
ymd_hms("2005/10/23 14:40:00")
## [1] "2005-10-23 14:40:00 UTC"
# If date is in an unsupported order like dym_msh, use parse_date_time() with argument orders specifying the order of the components in the date.
# Combine date parts with make_date(year, month, date).
r_3_4_1 <- ymd_hms("2016-05-03 07:13:28 UTC")
# Date rounding
floor_date(r_3_4_1, unit = "day")
## [1] "2016-05-03 UTC"
round_date(r_3_4_1, unit = "5 minutes")
## [1] "2016-05-03 07:15:00 UTC"
ceiling_date(r_3_4_1, unit = "week")
## [1] "2016-05-08 UTC"
Subtract dates with simple -
operator for days unit, or get finer control with base
function difftime(t1, t2, units)
. Available system dates are now
and today()
.
date_landing <- mdy("July 20, 1969")
moment_step <- mdy_hms("July 20, 1969, 02:56:15", tz = "UTC")
difftime(today(), date_landing, units = "days")
## Time difference of 18075 days
difftime(now(), moment_step, units = "secs")
## Time difference of 1561709101 secs
Use timespans to add fixed amount of time to dates. Distinguish periods (human understanding) from durations (number of seconds) to handle daylight savings time gracefully. By combining addition and multiplication with sequences you can generate sequences of datetimes.
library(lubridate)
# Add a period of one week to mon_2pm
mon_2pm <- dmy_hm("27 Aug 2018 14:00")
mon_2pm + weeks(1)
## [1] "2018-09-03 14:00:00 UTC"
# Add a duration of 81 hours to tue_9am
tue_9am <- dmy_hm("28 Aug 2018 9:00")
tue_9am + dhours(81)
## [1] "2018-08-31 18:00:00 UTC"
# A period of five years is longer than a duration of 5 years!
today() - years(5)
## [1] "2014-01-14"
today() - dyears(5)
## [1] "2014-01-15"
# Create combined periods and durations.
eclipse_2017 <- ymd_hms("2017-08-21 18:26:40")
synodic <- ddays(29) + dhours(12) + dminutes(44) + dseconds(3)
# Create datetime for every two weeks for a year
today_8am <- today() + hours(8)
every_two_weeks <- 1:26 * weeks(2)
today_8am + every_two_weeks
## [1] "2019-01-28 08:00:00 UTC" "2019-02-11 08:00:00 UTC"
## [3] "2019-02-25 08:00:00 UTC" "2019-03-11 08:00:00 UTC"
## [5] "2019-03-25 08:00:00 UTC" "2019-04-08 08:00:00 UTC"
## [7] "2019-04-22 08:00:00 UTC" "2019-05-06 08:00:00 UTC"
## [9] "2019-05-20 08:00:00 UTC" "2019-06-03 08:00:00 UTC"
## [11] "2019-06-17 08:00:00 UTC" "2019-07-01 08:00:00 UTC"
## [13] "2019-07-15 08:00:00 UTC" "2019-07-29 08:00:00 UTC"
## [15] "2019-08-12 08:00:00 UTC" "2019-08-26 08:00:00 UTC"
## [17] "2019-09-09 08:00:00 UTC" "2019-09-23 08:00:00 UTC"
## [19] "2019-10-07 08:00:00 UTC" "2019-10-21 08:00:00 UTC"
## [21] "2019-11-04 08:00:00 UTC" "2019-11-18 08:00:00 UTC"
## [23] "2019-12-02 08:00:00 UTC" "2019-12-16 08:00:00 UTC"
## [25] "2019-12-30 08:00:00 UTC" "2020-01-13 08:00:00 UTC"
ymd("2018-01-31") + months(1)
returns NA. For situations like this, use alternative operators like %m+%
.
library(lubridate)
# A sequence of 1 to 12 periods of 1 month
month_seq <- 1:12 * months(1)
# Add 1 to 12 months to jan_31. This way returns NAs.
ymd("2018-01-31") + month_seq
## [1] NA "2018-03-31" NA "2018-05-31" NA
## [6] "2018-07-31" "2018-08-31" NA "2018-10-31" NA
## [11] "2018-12-31" "2019-01-31"
# Better way.
ymd("2018-01-31") %m+% month_seq
## [1] "2018-02-28" "2018-03-31" "2018-04-30" "2018-05-31" "2018-06-30"
## [6] "2018-07-31" "2018-08-31" "2018-09-30" "2018-10-31" "2018-11-30"
## [11] "2018-12-31" "2019-01-31"
Intervals have a specific start and end time. There are two notations: datetime1 %--% datetime2
, or interval(datetime1, datetime2)
.
# Two ways to create an interval.
dmy("5 January 1961") %--% dmy("30 January 1969")
## [1] 1961-01-05 UTC--1969-01-30 UTC
interval(dmy("5 January 1961"), dmy("30 January 1969"))
## [1] 1961-01-05 UTC--1969-01-30 UTC
Once you have an interval you can find out its start, end, and length with int_start(), int_end() and int_length() respectively. You can test whether a date is %within%
and interval. You can test whether two intervals overlap with int_overlaps()
.
my_intvl <- interval(dmy("5 January 1961"), dmy("30 January 1969"))
int_length(my_intvl)
## [1] 254620800
y2001 <- ymd("2001-01-01") %--% ymd("2001-12-31")
ymd("2001-03-30") %within% y2001
## [1] TRUE
Convert an interval to a period or duration with as.period
and as.duration
.
my_intvl <- interval(dmy("5 January 1961"), dmy("30 January 1969"))
as.period(my_intvl)
## [1] "8y 0m 25d 0H 0M 0S"
as.duration(my_intvl)
## [1] "254620800s (~8.07 years)"
Extract timezone with tz()
. Change timezone with force_tz(dt, tzone=)
or temporarily view it with with_tz(dt, tzone=)
. Get tzone
names from ’OlsonNames()`.
game2 <- mdy_hm("June 11 2015 19:00")
game3 <- mdy_hm("June 15 2015 18:30")
# Set the timezone to "America/Edmonton"
game2_local <- force_tz(game2, tzone = "America/Edmonton")
game3_local <- force_tz(game3, tzone = "America/Winnipeg")
# What time is game2_local in NZ?
with_tz(game2_local, tzone = "Pacific/Auckland")
## [1] "2015-06-12 13:00:00 NZST"
stamp
is a great way to format a date. It returns a function with format string you specify by example.
stamp("09/20/2017")(today())
## Multiple formats matched: "%Om/%d/%y%H"(1), "%Om/%y/%d%H"(1), "%Om/%d/%Y"(1), "%m/%d/%y%H"(1), "%m/%y/%d%H"(1), "%m/%d/%Y"(1)
## Using: "%Om/%y/%d%H"
## [1] "01/19/1400"
Package stringr
manipulates strings.
library(stringr)
# trim whitespace.
str_trim(" this is a test ")
## [1] "this is a test"
# pad string with zeros.
str_pad("2493", width = 7, side = "left", pad = "0")
## [1] "0002493"
# find pattern Alice
str_detect(c("Sarah", "Alice", "Tom"), "Alice")
## [1] FALSE TRUE FALSE
# replace pattern Alice with Jeff
str_replace(c("Sarah", "Alice", "Tom"), "Alice", "Jeff")
## [1] "Sarah" "Jeff" "Tom"
# Change case
toupper("DataCamp")
## [1] "DATACAMP"
tolower("DataCamp")
## [1] "datacamp"
Use is.na()
to locate null values.
# 4x3 data frame with a few NAs.
df <- data.frame(A = c(1, NA, 8, NA),
B = c(3, NA, 88, 23),
C = c(2, 45, 3, 1),
D = c("A", "", "C", "D"))
# Any NAs?
any(is.na(df))
## [1] TRUE
# locate the NAs.
is.na(df)
## A B C D
## [1,] FALSE FALSE FALSE FALSE
## [2,] TRUE TRUE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE
## [4,] TRUE FALSE FALSE FALSE
# How many?
sum(is.na(df))
## [1] 3
# Summarize the NAs
summary(df)
## A B C D
## Min. :1.00 Min. : 3.0 Min. : 1.00 :1
## 1st Qu.:2.75 1st Qu.:13.0 1st Qu.: 1.75 A:1
## Median :4.50 Median :23.0 Median : 2.50 C:1
## Mean :4.50 Mean :38.0 Mean :12.75 D:1
## 3rd Qu.:6.25 3rd Qu.:55.5 3rd Qu.:13.50
## Max. :8.00 Max. :88.0 Max. :45.00
## NA's :2 NA's :1
# Rows with no missing values, two ways
df[complete.cases(df),]
## A B C D
## 1 1 3 2 A
## 3 8 88 3 C
na.omit(df)
## A B C D
## 1 1 3 2 A
## 3 8 88 3 C
# Replace empty strings with NA
df$D <- df$D[df$D == ""] <- NA
df2 <- data.frame(A = rnorm(100,50,10),
B = c(rnorm(99,50,10), 500),
C = c(rnorm(99,50,10), -1))
# Find outliers using hist() or boxplot().
hist(df2$B)
boxplot(df2)
# Drop or replace outliers. Use which() to find index of offending observation.
mymtcars <- mtcars
ind <- which(mymtcars$mpg == 15.0)
mymtcars$mpg[ind] = 20.0
dplyr
The dplyr
package provides data wrangling tools. dplyr
introduces the tibble, a dataframe constrained to display well in an R session. The tibble class inherits from the data frame class. Work with a tibble using the tbl_df(data.frame)
function. glimpse(tbl)
works with tibbles the way str(data.frame)
works with data frames. Convert a tibble back to a data frame with as.data.frame(tbl)
.
library(dplyr)
# hflights is a data.frame of Houston based flights.
library(hflights)
## Warning: package 'hflights' was built under R version 3.4.4
hflights <- as_tibble(hflights)
head(hflights)
## # A tibble: 6 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 2 7 1401 1501 AA 428
## 3 2011 1 3 1 1352 1502 AA 428
## 4 2011 1 4 2 1403 1513 AA 428
## 5 2011 1 5 3 1405 1507 AA 428
## 6 2011 1 6 4 1359 1503 AA 428
## # ... with 13 more variables: TailNum <chr>, ActualElapsedTime <int>,
## # AirTime <int>, ArrDelay <int>, DepDelay <int>, Origin <chr>,
## # Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## # Cancelled <int>, CancellationCode <chr>, Diverted <int>
summary(hflights)
## Year Month DayofMonth DayOfWeek
## Min. :2011 Min. : 1.000 Min. : 1.00 Min. :1.000
## 1st Qu.:2011 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2.000
## Median :2011 Median : 7.000 Median :16.00 Median :4.000
## Mean :2011 Mean : 6.514 Mean :15.74 Mean :3.948
## 3rd Qu.:2011 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:6.000
## Max. :2011 Max. :12.000 Max. :31.00 Max. :7.000
##
## DepTime ArrTime UniqueCarrier FlightNum
## Min. : 1 Min. : 1 Length:227496 Min. : 1
## 1st Qu.:1021 1st Qu.:1215 Class :character 1st Qu.: 855
## Median :1416 Median :1617 Mode :character Median :1696
## Mean :1396 Mean :1578 Mean :1962
## 3rd Qu.:1801 3rd Qu.:1953 3rd Qu.:2755
## Max. :2400 Max. :2400 Max. :7290
## NA's :2905 NA's :3066
## TailNum ActualElapsedTime AirTime ArrDelay
## Length:227496 Min. : 34.0 Min. : 11.0 Min. :-70.000
## Class :character 1st Qu.: 77.0 1st Qu.: 58.0 1st Qu.: -8.000
## Mode :character Median :128.0 Median :107.0 Median : 0.000
## Mean :129.3 Mean :108.1 Mean : 7.094
## 3rd Qu.:165.0 3rd Qu.:141.0 3rd Qu.: 11.000
## Max. :575.0 Max. :549.0 Max. :978.000
## NA's :3622 NA's :3622 NA's :3622
## DepDelay Origin Dest Distance
## Min. :-33.000 Length:227496 Length:227496 Min. : 79.0
## 1st Qu.: -3.000 Class :character Class :character 1st Qu.: 376.0
## Median : 0.000 Mode :character Mode :character Median : 809.0
## Mean : 9.445 Mean : 787.8
## 3rd Qu.: 9.000 3rd Qu.:1042.0
## Max. :981.000 Max. :3904.0
## NA's :2905
## TaxiIn TaxiOut Cancelled CancellationCode
## Min. : 1.000 Min. : 1.00 Min. :0.00000 Length:227496
## 1st Qu.: 4.000 1st Qu.: 10.00 1st Qu.:0.00000 Class :character
## Median : 5.000 Median : 14.00 Median :0.00000 Mode :character
## Mean : 6.099 Mean : 15.09 Mean :0.01307
## 3rd Qu.: 7.000 3rd Qu.: 18.00 3rd Qu.:0.00000
## Max. :165.000 Max. :163.00 Max. :1.00000
## NA's :3066 NA's :2947
## Diverted
## Min. :0.000000
## 1st Qu.:0.000000
## Median :0.000000
## Mean :0.002853
## 3rd Qu.:0.000000
## Max. :1.000000
##
# hflights consists of 227,496 observations and 21 variables.
nrow(hflights)
## [1] 227496
ncol(hflights)
## [1] 21
# Create a lookup table for the UniqueCarrier column using a named vector.
lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental",
"DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways",
"WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier",
"FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")
hflights$Carrier <- lut[hflights$UniqueCarrier]
dplyr
features five verbs. * select(.data, ...)
where ...
are variables. Use :
to select a range of variables, and -
to exclude some variables, similar to indexing a data.frame with square brackets. Use variable names or integer indexes. Use helper functions starts_with()
, ends_with()
, contains()
, matches()
, num_range()
, and one_of()
. * filter(.data, one or more comparisons)
. Among the operators are ==
, !=
, and %in%
. Combine comparisons with &
and |
. * arrange(.data, ...)
. Wrap the arguments with desc()
to override the default sort order. * mutate(.data, name-value pair of expressions)
. * summarise(.data, ...)
. Base r includes several aggregate functions, and dplyr
adds first()
, last()
, nth()
, n()
, and n_distinct()
. Pipe a data set with %>%
into a verb. The filter()
verb returns a filtered data set. The arrange()
verb returns a sorted data set. Arrange in descending order by arrange(desc(gdpPerCap))
. The mutate()
verb adds or changes values in the data set. group_by(.data, col(s))
. group_by
only has an effect when combined with a summarize()
function. Specify group_by
prior to summarize()
.
dplry
uses %>%
from the magrittr
package.
library(dplyr)
library(hflights)
hflights <- as_tibble(hflights)
lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental",
"DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways",
"WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier",
"FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")
hflights$Carrier <- lut[hflights$UniqueCarrier]
# select example
select(hflights, UniqueCarrier, ends_with("Num"), starts_with("Cancell"))
## # A tibble: 227,496 x 5
## UniqueCarrier FlightNum TailNum Cancelled CancellationCode
## * <chr> <int> <chr> <int> <chr>
## 1 AA 428 N576AA 0 ""
## 2 AA 428 N557AA 0 ""
## 3 AA 428 N541AA 0 ""
## 4 AA 428 N403AA 0 ""
## 5 AA 428 N492AA 0 ""
## 6 AA 428 N262AA 0 ""
## 7 AA 428 N493AA 0 ""
## 8 AA 428 N477AA 0 ""
## 9 AA 428 N476AA 0 ""
## 10 AA 428 N504AA 0 ""
## # ... with 227,486 more rows
# mutate example
g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime)
## Warning: `as_dictionary()` is soft-deprecated as of rlang 0.3.0.
## Please use `as_data_pronoun()` instead
## This warning is displayed once per session.
## Warning: `new_overscope()` is soft-deprecated as of rlang 0.2.0.
## Please use `new_data_mask()` instead
## This warning is displayed once per session.
## Warning: The `parent` argument of `new_data_mask()` is deprecated.
## The parent of the data mask is determined from either:
##
## * The `env` argument of `eval_tidy()`
## * Quosure environments when applicable
## This warning is displayed once per session.
## Warning: `overscope_clean()` is soft-deprecated as of rlang 0.2.0.
## This warning is displayed once per session.
# filter example
hflights %>%
mutate(RealTime = ActualElapsedTime + 100, mph = 60 * Distance/ RealTime) %>%
filter(!is.na(mph) & mph < 70) %>%
group_by(UniqueCarrier) %>%
summarize(n_less = n(), n_dest = n_distinct(Dest), min_dist = min(Distance), max_dist = max(Distance))
## # A tibble: 6 x 5
## UniqueCarrier n_less n_dest min_dist max_dist
## <chr> <int> <int> <dbl> <dbl>
## 1 AA 40 1 224. 224.
## 2 CO 3393 4 140. 305.
## 3 MQ 12 1 247. 247.
## 4 OO 349 3 140. 224.
## 5 WN 1747 4 148. 239.
## 6 XE 1185 12 79. 253.
dplyr
works for data frames, data tables, and databases.
Use dplyr
to merge data instead of base r merge()
because dplr
syntax is intuitive, preserves row order, and works with databases.
The four mutating joins are left_join(tbl1, tbl2, by = c(col_names))
, right_join
, inner_join
, and full_join
.
Filter join semi_join
performs an inner join without returning the secondary table. Filter join anti_join
performs a right where the right table is null.
Set functions union()
, intersect
, and setdiff
.
setequal(set1, set2)
checks for row equality (not necesarily order).
If two datasets have identical structure, combine with bind_rows()
and bind_cols()
, the dplyr
equivalent to base r rbind()
and cbind
.
dplyr
improves base r
functions data.frame
with data_frame()
. data_frame()
will not change data types, add row or column names, or recycle vectors. Function as_data_frame()
parellels the behavior of data_frame()
. as_data_frame
combines a list of vectors into a data frame. It is the column equivalent of bind_rows()
which combines data frames.
library(Lahman)
## Warning: package 'Lahman' was built under R version 3.4.4
library(dplyr)
players <- Master %>%
distinct(playerID, nameFirst, nameLast)
players %>%
# Find unsalaried players
anti_join(Salaries, by = "playerID") %>%
# Join Batting to the unsalaried players
left_join(Batting, by = "playerID") %>%
# Group by player
group_by(playerID) %>%
# Sum at-bats for each player
summarise(total_at_bat = sum(AB, na.rm = TRUE)) %>%
# Arrange in descending order
arrange(desc(total_at_bat))
## # A tibble: 13,958 x 2
## playerID total_at_bat
## <chr> <int>
## 1 aaronha01 12364
## 2 yastrca01 11988
## 3 cobbty01 11434
## 4 musiast01 10972
## 5 mayswi01 10881
## 6 robinbr01 10654
## 7 wagneho01 10430
## 8 brocklo01 10332
## 9 ansonca01 10277
## 10 aparilu01 10230
## # ... with 13,948 more rows
library(Lahman)
library(dplyr)
# Find the distinct players that appear in HallOfFame
nominated <- HallOfFame %>%
distinct(playerID)
nominated %>%
# Count the number of players in nominated
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 1260
# 1,239 players were nominated for the hall of fame.
nominated_full <- nominated %>%
# Join to Master
left_join(Master, by = "playerID") %>%
# Return playerID, nameFirst, nameLast
select(playerID, nameFirst, nameLast)
# Find distinct players in HallOfFame with inducted == "Y"
inducted <- HallOfFame %>%
filter(inducted == "Y") %>%
distinct(playerID)
inducted %>%
# Count the number of players in inducted
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 317
# 312 players have been inducted.
inducted_full <- inducted %>%
# Join to Master
left_join(Master, by = "playerID") %>%
# Return playerID, nameFirst, nameLast
select(playerID, nameFirst, nameLast)
# Tally the number of awards in AwardsPlayers by playerID
nAwards <- AwardsPlayers %>%
group_by(playerID) %>%
tally()
nAwards %>%
# Filter to just the players in inducted
semi_join(inducted, by = "playerID") %>%
# Calculate the mean number of awards per player
summarize(avg_n = mean(n, na.rm = TRUE))
## # A tibble: 1 x 1
## avg_n
## <dbl>
## 1 12.1
nAwards %>%
# Filter to just the players in nominated
semi_join(nominated, by = "playerID") %>%
# Filter to players NOT in inducted
anti_join(inducted, by = "playerID") %>%
# Calculate the mean number of awards per player
summarize(avg_n = mean(n, na.rm = TRUE))
## # A tibble: 1 x 1
## avg_n
## <dbl>
## 1 4.23
# On Average, inductees had 11.95 - 4.23 = 7.72 more awards than non-inductees.
# Find the players who are in nominated, but not inducted
notInducted <- nominated %>%
setdiff(inducted)
Salaries %>%
# Find the players who are in notInducted
semi_join(notInducted, by = "playerID") %>%
# Calculate the max salary by player
group_by(playerID) %>%
summarize(max_salary = max(salary, na.rm = TRUE)) %>%
# Calculate the average of the max salaries
summarize(avg_salary = mean(max_salary, na.rm = TRUE))
## # A tibble: 1 x 1
## avg_salary
## <dbl>
## 1 5230273.
# Repeat for players who were inducted
Salaries %>%
semi_join(inducted, by = "playerID") %>%
group_by(playerID) %>%
summarize(max_salary = max(salary, na.rm = TRUE)) %>%
summarize(avg_salary = mean(max_salary, na.rm = TRUE))
## # A tibble: 1 x 1
## avg_salary
## <dbl>
## 1 6092038.
Appearances %>%
# Filter Appearances against nominated
semi_join(nominated, by = "playerID") %>%
# Find last year played by player
group_by(playerID) %>%
summarize(last_year = max(yearID)) %>%
# Join to full HallOfFame
left_join(HallOfFame, by = "playerID") %>%
# Filter for unusual observations
filter((yearID - last_year)<5)
## # A tibble: 194 x 10
## playerID last_year yearID votedBy ballots needed votes inducted
## <chr> <dbl> <int> <chr> <int> <int> <int> <fct>
## 1 altroni01 1933. 1937 BBWAA 201 151 3 N
## 2 applilu01 1950. 1953 BBWAA 264 198 2 N
## 3 bartedi01 1946. 1948 BBWAA 121 91 1 N
## 4 beckro01 2004. 2008 BBWAA 543 408 2 N
## 5 boudrlo01 1952. 1956 BBWAA 193 145 2 N
## 6 camildo01 1945. 1948 BBWAA 121 91 1 N
## 7 chandsp01 1947. 1950 BBWAA 168 126 2 N
## 8 chandsp01 1947. 1951 BBWAA 226 170 1 N
## 9 chapmbe01 1946. 1949 BBWAA 153 115 1 N
## 10 cissebi01 1938. 1937 BBWAA 201 151 1 N
## # ... with 184 more rows, and 2 more variables: category <fct>,
## # needed_note <chr>
Data visualization is about exploratory analysis (investigative) and explanatory analysis.
There are seven grammatical layers of plots; three are required: data, aesthetics, and geometries. The other elements are facets (subplots), statistics (e.g., fitted lines), coordinates, and themes. The grammar of graphics is implemented in the ggplot2
package.
Base r provides plotting functionality, but it comes with limitations. The plot is an image, not an object, so you cannot manipulate it further. It does not present a legend. There is a separate function for each plot type. The lack of a unified framework means you will have to learn each plot type separately: points()
, hist()
, etc.
Scale the x axis with a scale_x_log10
layer. There are two main reasons to use logarithmic scales in charts and graphs. The first is to respond to skewness towards large values; i.e., cases in which one or a few points are much larger than the bulk of the data. The second is to show percent change or multiplicative factors. On a scaled access with base 2, the value of each tick mark is double the value of the preceding one. An example of a multiplicative factor is constant acceleration. More on scales for continuous data here.
For scatterplots, map x
, y
, color
, and shape
in the aesthetic layer. Map size
, fill
, shape
, alpha
(transparency), and position
(e.g., “jitter”) in the geom_point
layer.
mtcars$cyl <- as.factor(mtcars$cyl)
# Use base r to create plots with a series for each cyl value.
# Add a linear fit line through the points, one for each series, and one overall.
plot(mtcars$wt, mtcars$mpg, col = factor(mtcars$cyl))
abline(lm(mpg ~ wt, data = mtcars), lty = 2)
lapply(mtcars$cyl, function(x) {
abline(lm(mpg ~ wt, mtcars, subset = (cyl == x)), col = x)
})
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
##
## [[10]]
## NULL
##
## [[11]]
## NULL
##
## [[12]]
## NULL
##
## [[13]]
## NULL
##
## [[14]]
## NULL
##
## [[15]]
## NULL
##
## [[16]]
## NULL
##
## [[17]]
## NULL
##
## [[18]]
## NULL
##
## [[19]]
## NULL
##
## [[20]]
## NULL
##
## [[21]]
## NULL
##
## [[22]]
## NULL
##
## [[23]]
## NULL
##
## [[24]]
## NULL
##
## [[25]]
## NULL
##
## [[26]]
## NULL
##
## [[27]]
## NULL
##
## [[28]]
## NULL
##
## [[29]]
## NULL
##
## [[30]]
## NULL
##
## [[31]]
## NULL
##
## [[32]]
## NULL
legend(x = 5, y = 33, legend = levels(mtcars$cyl),
col = 1:3, pch = 1, bty = "n")
# Again in ggplot2
# The first geom_smooth inherits the ggplot color aesthetic as its group.
# The second geom_smooth explicity sets group to a dummy 1. The col = "All" adds it to the legend.
# When mapping onto color you can sometimes treat a continuous scale, like year, as an ordinal variable, but only if it is a regular series. The better alternative is to leave it as a continuous variable and use the group aesthetic as a factor to make sure your plot is drawn correctly.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
ggplot(mtcars, aes(x = wt, y = mpg, col = cyl, group = factor(cyl))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
geom_smooth(method = "lm", se = FALSE, linetype = 2, aes(group = 1, col = "All"))
ggplot
can visualize four attributes at once with x
, y
, col
, and facet_grid
. Such graphing requires tidy data, which in turn requires thoughtful definitions of metrics. In the iris data set, if measuring length vs width, then those are separate variables (cols). If measuring length (or width) vs species, then species is a variable. If measuring length (or width) vs part of flower (petal vs sepal), then flower part is a variable. To look at all four together, then length and width are members of the measure variable (because length and width share units).
library(ggplot2)
library(tidyr)
iris.tidy <- iris %>%
# gather(data, key, value, <cols>)
# Transpose all cols to rows except the identifier cols (Species)
# The former call name becomes a value in the key column.
gather(key, Value, -Species) %>%
# separate(data, col, into, sep)
separate(col = key, into = c("Part", "Measure"), sep = "\\.")
# If we want the ploy Length vs width, then each should be a column.
iris$Flower <- 1:nrow(iris)
iris.wide <- iris %>%
gather(key, value, -Species, -Flower) %>%
separate(key, c("Part", "Measure"), "\\.") %>%
spread(Measure, value)
ggplot(iris.wide, aes(x = Length, y = Width, color = Part)) +
geom_jitter() +
facet_grid(. ~ Species)
Typical aesthetics are x
, y
, colour
, fill
, size
, alpha
, linetype
, labels
, and shape
. shape
s 1:20 can accept only the color
aesthetic, and shape
s 21:25 accepts both color
and fill
.
One common technique to use with solid shapes is alpha blending (i.e. adding transparency). An alternative is to use hollow shapes.
library(ggplot2)
# Basic scatter plot: wt on x-axis and mpg on y-axis; map cyl to col
ggplot(mtcars, aes(x = wt, y = mpg, color = cyl)) +
geom_point(size = 4)
# Hollow circles - an improvement
ggplot(mtcars, aes(x = wt, y = mpg, color = cyl)) +
geom_point(size = 4, shape = 1)
# Add transparency - very nice
ggplot(mtcars, aes(x = wt, y = mpg, color = cyl)) +
geom_point(size = 4, alpha = 0.6)
The default geom_smooth
method is LOESS. LOESS smoothing is a non-parametric form of regression that uses a weighted, sliding-window, average to calculate a line of best fit. Control the size of this window with the span
argument. The default span is 0.9. Reducing span creates a better fit, but risks over-fitting.
Another useful stat function is stat_sum(). This function calculates the total number of overlapping observations and is another good alternative to overplotting.
The x axis/aesthetic: The argument stat
defaults to “bin” to cut up a continuous variable into discrete bins. The argument binwidth
defaults to range/30. This is a good starting point if you do not know anything about the variable and want to start exploring. The y axis/aesthetic: geom_histogram() only requires one aesthetic: x. But there is clearly a y axis, so where does it come from? The variable ..count..
is mapped to the y aesthetic. There is an internal data frame where this information is stored. The ..
calls the variable count
from this internal data frame. This is what appears on the y aesthetic. But it gets better! The ..density..
has also been calculated.
library(ggplot2)
# 1 - Make a univariate histogram
ggplot(mtcars, aes(x = mpg)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# 2 - Plot 1, plus set binwidth to 1 in the geom layer
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 1)
# 3 - Plot 2, plus MAP ..density.. to the y aesthetic (i.e. in a second aes() function)
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(aes(y = ..density..), binwidth = 1)
# 4 - plot 3, plus SET the fill attribute to "#377EB8"
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(aes(y = ..density.., fill = "#377EB8"), binwidth = 1)
#ggplot(gapminder_1952, aes(x = pop)) +
#geom_histogram() +
#scale_x_log10()
Use bar plots to compare numeric values across categorical variables. Bar plots use layer geom_col()
.
Like geom_point(), the geom_bar() and geom_histogram() geoms have a position argument which specifies how to draw the bars of the plot. Three common positions are stack (default) with counts, fill with proportions, and dodge with counts.
library(ggplot2)
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$am <- as.factor(mtcars$am)
# Draw a bar plot of cyl, filled according to am
ggplot(mtcars, aes(x = cyl, fill = am)) +
geom_bar()
# Change the position argument to stack
ggplot(mtcars, aes(x = cyl, fill = am)) +
geom_bar(position = "stack")
# Change the position argument to fill
ggplot(mtcars, aes(x = cyl, fill = am)) +
geom_bar(position = "fill")
# Change the position argument to dodge
ggplot(mtcars, aes(x = cyl, fill = am)) +
geom_bar(position = "dodge")
# Set the amount of dodging by specifying dodge as its own object.
posn_d = position_dodge(width = 0.2)
ggplot(mtcars, aes(x = cyl, fill = am)) +
geom_bar(position = posn_d, alpha = 0.6)
by_continent <- gapminder %>%
filter(year == 1952) %>%
group_by(continent) %>%
summarize(medianGdpPercap = median(gdpPercap))
# Create a bar plot showing medianGdp by continent'
ggplot(by_continent, aes(x = continent, y = medianGdpPercap)) +
geom_col()
Set geom_point(position = )
attributes with identity
(default), dodge
(side-by-side bar), stack
, fill
(stacked bar), jitter
, and jitterdodge
. Set aesthetic scale functions with scale_<aesthetic>_<data_type>
.
library(ggplot2)
cyl.am <- ggplot(mtcars, aes(x = factor(cyl), fill = factor(am)))
val = c("#E41A1C", "#377EB8")
lab = c("Manual", "Automatic")
cyl.am +
geom_bar(position = "dodge") +
scale_x_discrete("Cylinders") +
scale_y_continuous("Number") +
scale_fill_manual("Transmission",
values = val,
labels = lab)
Line plots are almost exactly like scatter plots. Use them to show change over time.
gapminder_gt1952 <- gapminder %>%
filter(year >= 1952) %>%
group_by(year, continent) %>%
summarize(medianGdpPercap = median(gdpPercap), sumPop = sum(pop))
## Warning in summarise_impl(.data, dots): integer overflow - use
## sum(as.numeric(.))
## Warning in summarise_impl(.data, dots): integer overflow - use
## sum(as.numeric(.))
## Warning in summarise_impl(.data, dots): integer overflow - use
## sum(as.numeric(.))
## Warning in summarise_impl(.data, dots): integer overflow - use
## sum(as.numeric(.))
## Warning in summarise_impl(.data, dots): integer overflow - use
## sum(as.numeric(.))
## Warning in summarise_impl(.data, dots): integer overflow - use
## sum(as.numeric(.))
## Warning in summarise_impl(.data, dots): integer overflow - use
## sum(as.numeric(.))
## Warning in summarise_impl(.data, dots): integer overflow - use
## sum(as.numeric(.))
ggplot(gapminder_gt1952, aes(x = year, y = medianGdpPercap,
color = continent)) +
geom_line()
library(ggplot2)
recess <- data.frame(begin = c('1970-01-01', '1975-01-01', '1980-01-01', '1982-01-01', '1991-01-01', '2001-01-01'),
end = c('1970-12-01', '1976-12-01', '1980-12-01', '1983-12-01', '1991-12-01', '2001-12-01'))
recess$begin <- as.Date(recess$begin)
recess$end <- as.Date(recess$end)
ggplot(economics, aes(x = date, y = unemploy/pop)) +
geom_rect(data = recess,
aes(xmin = begin, xmax = end, ymin = -Inf, ymax = +Inf),
inherit.aes = FALSE, fill = "red", alpha = 0.2) +
geom_line()
qplot is a quick-and-dirty variation of ggplot.
Two options for changing the coordinate layer is scale_x_continuous and coord_cartesian.
library(ggplot2)
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$am <- as.factor(mtcars$am)
p <- ggplot(mtcars, aes(x = wt, y = hp, col = am)) + geom_point() + geom_smooth()
# Add scale_x_continuous()
p + scale_x_continuous(limits = c(3, 6), expand = c(0, 0))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 12 rows containing non-finite values (stat_smooth).
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : at 3.168
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : radius 4e-006
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : all data on boundary of neighborhood. make span bigger
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 3.168
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.002
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : at 3.572
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : radius 4e-006
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : all data on boundary of neighborhood. make span bigger
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4e-006
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : zero-width neighborhood. make span bigger
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : zero-width neighborhood. make span bigger
## Warning: Computation failed in `stat_smooth()`:
## NA/NaN/Inf in foreign function call (arg 5)
## Warning: Removed 12 rows containing missing values (geom_point).
# Add coord_cartesian(): the proper way to zoom in
p + coord_cartesian(xlim = c(3, 6))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Set the aspect ratio of a plot with coord_fixed() or coord_equal(). Both use ratio = 1 as a default. A 1:1 aspect ratio is appropriate when two continuous variables are on the same scale, as with the iris dataset.
# Complete basic scatter plot function
base.plot <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
geom_jitter() +
geom_smooth(method = "lm", se = FALSE)
# Plot base.plot: default aspect ratio
base.plot
# Fix aspect ratio (1:1) of base.plot
base.plot + coord_equal()
Facets are another way of presenting categorical variables. The most straightforward way of using facets is facet_grid(). Here we just need to specify the categorical variable to use on rows and columns using standard R formula notation (rows ~ columns).
# Basic scatter plot
p <- ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point()
# 1 - Separate rows according to transmission type, am
p +
facet_grid(am ~ .)
# 2 - Separate columns according to cylinders, cyl
p +
facet_grid(. ~ cyl)
# 3 - Separate by both columns and rows
p +
facet_grid(am ~ cyl)
The themes layer handle all the non-data ink attributes. To change the appearance of lines use the element_line() function.
library(ggplot2)
#library(Hmisc)
# Base layers
m <- ggplot(mtcars, aes(x = cyl, y = wt))
# Draw dynamite plot
m +
stat_summary(fun.y = mean, geom = "bar", fill = "skyblue") +
stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width = 0.1)
library(haven)
# data from https://healthpolicy.ucla.edu/chis/data/Pages/GetCHISData.aspx.
# login as mpfoley73/HealthPolicy.
chis_path <- file.path("C:/Users/mpfol/OneDrive/Documents/Data Science/Data/CHIS 2009 PUF- Adult SAS", "adult.sas7bdat")
#chis_path <- file.path("C:/Users/michael.foley/OneDrive - The Centers for Families and Children/Documents/CHIS 2009 PUF- Adult SAS", "adult.sas7bdat")
adult <- read_sas(chis_path)
dim(adult)
## [1] 47614 536
adult <- adult[c("RBMI", "BMI_P", "RACEHPR2", "SRSEX", "SRAGE_P", "MARIT2", "AB1", "ASTCUR", "AB51", "POVLL")]
dim(adult)
## [1] 47614 10
library(dplyr)
adult <- adult %>% filter(RACEHPR2 == 1 | RACEHPR2 == 4 | RACEHPR2 == 5 | RACEHPR2 == 6)
dim(adult)
## [1] 44346 10
# Investigate the relationship between BMI and age.
# Start by looking at the distributions of the univariate data.
# The default histogram has in insteresting pattern of peaks. This is probably an artifact of the binning statistice of bins = 30. its range is 2.23 (range(adult$SRAGE_P) / 30)
library(ggplot2)
ggplot(adult, aes(x = SRAGE_P)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# The BMI histogram shows right skew. We might want to remove extreme values.
ggplot(adult, aes(x = BMI_P)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Typically we explore the relationship between two continuous variables with a scatterplot, but this one does not reveal any interesting trends.
ggplot(adult, aes(x = SRAGE_P, y = BMI_P)) +
geom_point()
# It turns out BMI is also ordinal: 0-18.49:=Underweight, 18.5-24.99:=Healty-weight, 25-29.99:=Over-weight, 30.0+:=Obese. Here is a plot by colored category. But there are still problems because the range of each group differs, and it is difficult to tell the size of each group.
ggplot(adult, aes(x = SRAGE_P, y = BMI_P, col = factor(RBMI))) +
geom_point(alpha = 0.4, position =position_jitter(width = 0.5))
# Try a histogram instead. This is good, but we cannot answer meaningful questions.
# Notice one unexpected attribute: it looks like ages >=85 are categorized as 85.
ggplot(adult, aes(x = SRAGE_P, fill = factor(RBMI))) +
geom_histogram(binwidth = 1)
# How do the proportions of each BMI category change across age groups? We need to plot proportions.
ggplot(adult, aes(x = SRAGE_P, fill = factor(RBMI))) +
geom_histogram(aes(y = ..count../sum(..count..)), binwidth = 1, position = "fill")
# there is an unusual spike of individuals at 85, which seems like an artifact of data collection and storage. Solve this by only keeping observations for which adult$SRAGE_P is smaller than or equal to 84.
adult <- adult[adult$SRAGE_P <= 84, ]
# There is a long positive tail on the BMIs that we'd like to remove. Only keep observations for which adult$BMI_P is larger than or equal to 16 and adult$BMI_P is strictly smaller than 52.
adult <- adult[adult$BMI_P >= 16 & adult$BMI_P < 52, ]
# We'll focus on the relationship between the BMI score (& category), age and race. To make plotting easier later on, we'll change the labels in the dataset.
adult$RACEHPR2 <- factor(adult$RACEHPR2, labels = c("Latino", "Asian", "African American", "White"))
adult$RBMI <- ordered(adult$RBMI,
levels = c(1, 2, 3, 4),
labels = c("Under-weight", "Healthy-weight", "Over-weight", "Obese"))
# The color scale used in the plot
BMI_fill <- scale_fill_brewer("BMI Category", palette = "Reds")
# Theme to fix category display in faceted plot
fix_strips <- theme(strip.text.y = element_text(angle = 0, hjust = 0, vjust = 0.1, size = 14),
strip.background = element_blank(),
legend.position = "none")
# Histogram, add BMI_fill and customizations
ggplot(adult, aes (x = SRAGE_P, fill= RBMI)) +
geom_histogram(binwidth = 1) +
fix_strips +
BMI_fill +
facet_grid(RBMI ~ .) +
theme_classic()
# The absolute count of multiple histograms is fine, but density would be a more useful measure if we wanted to see how the frequency of one variable changes across another. Here is a frequency histogram when we have many sub-categories. The problem here is that this can't be facetted because the calculations occur on the fly inside ggplot2.
ggplot(adult, aes (x = SRAGE_P, fill= factor(RBMI))) +
geom_histogram(aes(y = ..count../sum(..count..)), binwidth = 1, position = "fill") +
BMI_fill
# To overcome this we're going to calculate the proportions outside ggplot2.
# Create DF with table()
DF <- table(adult$RBMI, adult$SRAGE_P)
# Use apply on DF to get frequency of each group
DF_freq <- apply(DF, 2, function(x) x/sum(x))
# Load reshape2 and use melt on DF to create DF_melted
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
## The following objects are masked from 'package:data.table':
##
## dcast, melt
DF_melted <- melt(DF_freq)
# Change names of DF_melted
names(DF_melted) <- c("FILL", "X", "value")
# Add code to make this a faceted plot
ggplot(DF_melted, aes(x = X, y = value, fill = FILL)) +
geom_col(position = "stack") +
BMI_fill +
facet_grid(FILL ~ .) # Facets
Mosaic plots are visualizations of chi-squared tests.
# The initial contingency table
DF <- as.data.frame.matrix(table(adult$SRAGE_P, adult$RBMI))
# Create groupSum, xmax and xmin columns
DF$groupSum <- rowSums(DF)
DF$xmax <- cumsum(DF$groupSum)
DF$xmin <- DF$xmax - DF$groupSum
# The groupSum column needs to be removed; don't remove this line
DF$groupSum <- NULL
# Copy row names to variable X
DF$X <- row.names(DF)
# Melt the dataset
library(reshape2)
DF_melted <- melt(DF, id.vars = c("X", "xmin", "xmax"), variable.name = "FILL")
# dplyr call to calculate ymin and ymax - don't change
library(dplyr)
DF_melted <- DF_melted %>%
group_by(X) %>%
mutate(ymax = cumsum(value/sum(value)),
ymin = ymax - value/sum(value))
# Plot rectangles - don't change
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.4.4
ggplot(DF_melted, aes(ymin = ymin,
ymax = ymax,
xmin = xmin,
xmax = xmax,
fill = FILL)) +
geom_rect(colour = "white") +
scale_x_continuous(expand = c(0,0)) +
scale_y_continuous(expand = c(0,0)) +
BMI_fill +
theme_tufte()
# Perform chi.sq test (RBMI and SRAGE_P)
results <- chisq.test(table(adult$RBMI, adult$SRAGE_P))
# Melt results$residuals and store as resid
resid <- melt(results$residuals)
# Change names of resid
names(resid) <- c("FILL", "X", "residual")
# merge the two datasets:
DF_all <- merge(DF_melted, resid)
# Update plot command
library(ggthemes)
ggplot(DF_all, aes(ymin = ymin,
ymax = ymax,
xmin = xmin,
xmax = xmax,
fill = residual)) +
geom_rect() +
scale_fill_gradient2() +
scale_x_continuous(expand = c(0,0)) +
scale_y_continuous(expand = c(0,0)) +
theme_tufte()
Available aesthetics include x=continuous
, y=continuous
, color=<factor>
, and size=continuous
. Use facet_wrap(~ <factor>)
to create sub-plots. Use expand_limits(y = 0)
to ensure the y-axis crosses x at zero.
gapminder_gt1952 <- gapminder %>%
filter(year >= 1952)
ggplot(gapminder_gt1952, aes(x = gdpPercap, y = lifeExp,
color = continent, size = gdpPercap)) +
geom_point() +
scale_x_log10() +
facet_wrap(~ year)
Boxplots are builtwith geom_boxplot()
.
gapminder_1952 <- gapminder %>%
filter(year == 1952)
# Add a title to this graph: "Comparing GDP per capita across continents"
ggplot(gapminder_1952, aes(x = continent, y = gdpPercap)) +
geom_boxplot() +
scale_y_log10() +
labs(title = "Comparing GDP per capita across continents")
Use the summarize
verb to summarize grouped variables. Avaiable summary functions include mean
, sum
, median
, min
, and max
. Save the summarized data to an object for use in visualization.
by_country_year <- gapminder %>%
filter(continent == "Asia") %>%
group_by(country, year) %>%
summarize(medianLifeExp = median(lifeExp),
maxGdpPercap = max(gdpPercap))
ggplot(by_country_year, aes(x = year, y = medianLifeExp,
color = country, size = maxGdpPercap)) +
geom_point()
Get the dataframe dimensions with dim
to get a row and column count. Use the which.min()
and which.max()
functions to find the record with the smallest and largest value of the requested variable.
cars <- read.table("http://s3.amazonaws.com/assets.datacamp.com/course/uva/mtcars_semicolon.csv", sep = ";", header = TRUE)
head(cars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
dim(mtcars)
## [1] 32 11
cars[which.min(cars$mpg), ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## 15 10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
See the levels of a factor variable with the levels()
function
levels(as.factor(mtcars$am))
## [1] "0" "1"
Recode a variable by placing a condition in the row argument.
#Assign the value of mtcars to the new variable mtcars2
mtcars2 <- mtcars
#Assign the label "high" to mpgcategory where mpg is greater than or equal to 20
mtcars2$mpgcategory[mtcars2$mpg >= 20] <- "high"
#Assign the label "low" to mpgcategory where mpg is less than 20
mtcars2$mpgcategory[mtcars2$mpg < 20] <- "low"
#Assign mpgcategory as factor to mpgfactor
mtcars2$mpgfactor <- as.factor(mtcars2$mpgcategory)
Create a frequency table with table()
table(mtcars$am)
##
## 0 1
## 19 13
****Barplot**** y-var is height
, x-var is names.arg
.
data <- data.frame(outcome = 0:5, probs = c(0.1, 0.2, 0.3, 0.2, 0.1, 0.1))
# make a bar plot of the probability distribution
barplot(height = data$probs, names.arg = data$outcome)
Create a histogram with hist()
, and boxplot with boxplot
.
# Make a histogram of the carb variable from the mtcars data set. Set the title to "Carburetors"
# arguments to change the y-axis scale to 0 - 20, label the x-axis and colour the bars red
hist(mtcars$carb, main = "Carburetors", ylim = c(0,20), col = "red", xlab = "Number of Carburetors")
# Make a boxplot of qsec
boxplot(mtcars$qsec)
There is no mode funcdtion in R! Use a sorted table instead.
# Produce a sorted frequency table of `carb` from `mtcars`
sort(table(mtcars$carb), decreasing = TRUE)
##
## 2 4 1 3 6 8
## 10 10 7 3 1 1
Similarly, get the IQR with the min
and max()
functions.
# Minimum value
x <- min(mtcars$mpg)
# Maximum value
y <- max(mtcars$mpg)
# Calculate the range of mpg using x and y
y - x
## [1] 23.5
quantile(mtcars$qsec)
## 0% 25% 50% 75% 100%
## 14.5000 16.8925 17.7100 18.9000 22.9000
# Calculate the interquartile range of qsec
IQR(mtcars$qsec)
## [1] 2.0075
Calculate standard deviation with std
sd(mtcars$mpg)
## [1] 6.026948
*** Checking Assumpions *** Recall the four assumptions of OLS: 1) linear relationships between the response variable and each predictor variable, 2) independent predictor variables, 3) normally distributed residuals, and 4) equal residual variances.
**** Normality **** The residuals should be normally distributed. If they are not, the OLS estimators yield confidence intervals that are too wide or too narrow. Test with a normal probability plot qqnorm(cog_final$residuals)
or normal quantile plot qqline(cog_final$residuals)
(a bow-shaped deviated pattern indicates non-normality), histogram hist(cog_final$residuals)
, or residuals plot (look for random scatter around 0). Note: sometimes normality check fails when linearity assumption does not hold.
Create a scatterplot of quantitative data with plot
. Create a contingency table of categorical data with table()
. Calculate Pearson’s r with cor(var1,var2)
.
plot(women$weight, women$height, main = "Heights and Weights")
#table(smoking$tobacco,smoking$student)
money <- c(4, 3, 2, 2, 8, 1, 1, 2, 3, 4, 5, 6, 7, 9, 9, 8, 12)
education <- c(3, 4, 6, 9, 3, 3, 1, 2, 1, 4, 5, 7, 10, 8, 7, 6, 9)
# calculate the correlation between X and Y
cor(education,money)
## [1] 0.5846627
# save regression coefficients as object "line"
line<-lm(money~education)
# print the regression coefficients
line
##
## Call:
## lm(formula = money ~ education)
##
## Coefficients:
## (Intercept) education
## 1.5744 0.6731
# plot Y and X
plot(education,money, main="My Scatterplot")
# add the regression line
abline(line)
We can use
abline()
to add any line we like, as long as the first argument is the intercept and the second is the slope.
dnorm
returns the normal probability of X=x
when the mean is mean
and standard deviation is sd
. pnorm
returns the cumulative probability at the specified value (quantile) q
. qnorm
returns the value (quantile) q
at the specified cumulative probability (percentile) p
.
# probability of a woman having a hair length of less than 20 centimeters
round(pnorm(20, mean = 25, sd = 5), digits = 2)
## [1] 0.16
round(pnorm((20-25)/5), digits = 2)
## [1] 0.16
# 85th percentile of female hair length
qnorm(.85, mean = 25, sd = 5)
## [1] 30.18217
dbinom
returns the binomial probability of X=x
successes given size
trials and probability of success prob
. pbinom
returns the cumulative probability (percentile) p
at the specified value (quantile) q
. qbinom
returns the value (quantile) q
at the specified cumulative probability (percentile) p
.
# probability of answering 5 of 25 questions correctly when p = .2.
dbinom(x = 5, size = 25, prob = .2)
## [1] 0.1960151
# probability of answering >=5 of 25 questions correctly when p = .2.
pbinom(q = 4, size = 25, prob = .2, lower.tail = FALSE)
## [1] 0.5793257
# calculate the 60th percentile
qbinom(p = .6, size = 25, prob = .2)
## [1] 5
Sample data from a set with the sample()
function.
For loop.
# initialize an empty vector
new_number <- NULL
for (i in 1:10) {
new_number[i] <- i
}
print(new_number)
## [1] 1 2 3 4 5 6 7 8 9 10
url_sales <- 'http://s3.amazonaws.com/assets.datacamp.com/production/course_1294/datasets/sales.csv'
sales <- read.csv(url_sales)
# Inspect data.
dim(sales)
## [1] 5000 46
head(sales)
## X event_id primary_act_id secondary_act_id
## 1 1 abcaf1adb99a935fc661 43f0436b905bfa7c2eec b85143bf51323b72e53c
## 2 2 6c56d7f08c95f2aa453c 1a3e9aecd0617706a794 f53529c5679ea6ca5a48
## 3 3 c7ab4524a121f9d687d2 4b677c3f5bec71eec8d1 b85143bf51323b72e53c
## 4 4 394cb493f893be9b9ed1 b1ccea01ad6ef8522796 b85143bf51323b72e53c
## 5 5 55b5f67e618557929f48 91c03a34b562436efa3c b85143bf51323b72e53c
## 6 6 4f10fd8b9f550352bd56 ac4b847b3fde66f2117e 63814f3d63317f1b56c4
## purch_party_lkup_id
## 1 7dfa56dd7d5956b17587
## 2 4f9e6fc637eaf7b736c2
## 3 6c2545703bd527a7144d
## 4 527d6b1eaffc69ddd882
## 5 8bd62c394a35213bdf52
## 6 3b3a628f83135acd0676
## event_name
## 1 Xfinity Center Mansfield Premier Parking: Florida Georgia Line
## 2 Gorge Camping - dave matthews band - sept 3-7
## 3 Dodge Theatre Adams Street Parking - benise
## 4 Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow
## 5 Premier Parking - motley crue
## 6 Fast Lane Access: Journey
## primary_act_name secondary_act_name
## 1 XFINITY Center Mansfield Premier Parking NULL
## 2 Gorge Camping Dave Matthews Band
## 3 Parking Event NULL
## 4 Gexa Energy Pavilion VIP Parking NULL
## 5 White River Amphitheatre Premier Parking NULL
## 6 Fast Lane Access Journey
## major_cat_name minor_cat_name la_event_type_cat
## 1 MISC PARKING PARKING
## 2 MISC CAMPING INVALID
## 3 MISC PARKING PARKING
## 4 MISC PARKING PARKING
## 5 MISC PARKING PARKING
## 6 MISC SPECIAL ENTRY (UPSELL) UPSELL
## event_disp_name
## 1 Xfinity Center Mansfield Premier Parking: Florida Georgia Line
## 2 Gorge Camping - dave matthews band - sept 3-7
## 3 Dodge Theatre Adams Street Parking - benise
## 4 Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow
## 5 Premier Parking - motley crue
## 6 Fast Lane Access: Journey
## ticket_text
## 1 THIS TICKET IS VALID FOR PARKING ONLY GOOD THIS DAY ONLY PREMIER PARKING PASS XFINITY CENTER,LOTS 4 PM SAT SEP 12 2015 7:30 PM
## 2 %OVERNIGHT C A M P I N G%* * * * * *%GORGE CAMPGROUND%* GOOD THIS DATE ONLY *%SEP 3 - 6, 2009
## 3 ADAMS STREET GARAGE%PARKING FOR 4/21/06 ONLY%DODGE THEATRE PARKING PASS%ENTRANCE ON ADAMS STREET%BENISE%GARAGE OPENS AT 6:00PM
## 4 THIS TICKET IS VALID FOR PARKING ONLY GOOD FOR THIS DATE ONLY VIP PARKING PASS GEXA ENERGY PAVILION FRI SEP 02 2011 7:00 PM
## 5 THIS TICKET IS VALID%FOR PARKING ONLY%GOOD THIS DATE ONLY%PREMIER PARKING PASS%WHITE RIVER AMPHITHEATRE%SAT JUL 30, 2005 6:00PM
## 6 FAST LANE JOURNEY FAST LANE EVENT THIS IS NOT A TICKET SAN MANUEL AMPHITHEATER SAT JUL 21 2012 7:00 PM
## tickets_purchased_qty trans_face_val_amt delivery_type_cd
## 1 1 45 eTicket
## 2 1 75 TicketFast
## 3 1 5 TicketFast
## 4 1 20 Mail
## 5 1 20 Mail
## 6 2 10 TicketFast
## event_date_time event_dt presale_dt onsale_dt
## 1 2015-09-12 23:30:00 2015-09-12 NULL 2015-05-15
## 2 2009-09-05 01:00:00 2009-09-04 NULL 2009-03-13
## 3 2006-04-22 01:30:00 2006-04-21 NULL 2006-02-25
## 4 2011-09-03 00:00:00 2011-09-02 NULL 2011-04-22
## 5 2005-07-31 01:00:00 2005-07-30 2005-03-02 2005-03-04
## 6 2012-07-22 02:00:00 2012-07-21 NULL 2012-04-11
## sales_ord_create_dttm sales_ord_tran_dt print_dt timezn_nm
## 1 2015-09-11 18:17:45 2015-09-11 2015-09-12 EST
## 2 2009-07-06 00:00:00 2009-07-05 2009-09-01 PST
## 3 2006-04-05 00:00:00 2006-04-05 2006-04-05 MST
## 4 2011-07-01 17:38:50 2011-07-01 2011-07-06 CST
## 5 2005-06-18 00:00:00 2005-06-18 2005-06-28 PST
## 6 2012-07-21 17:20:18 2012-07-21 2012-07-21 PST
## venue_city venue_state venue_postal_cd_sgmt_1
## 1 MANSFIELD MASSACHUSETTS 02048
## 2 QUINCY WASHINGTON 98848
## 3 PHOENIX ARIZONA 85003
## 4 DALLAS TEXAS 75210
## 5 AUBURN WASHINGTON 98092
## 6 SAN BERNARDINO CALIFORNIA 92407
## sales_platform_cd print_flg la_valid_tkt_event_flg fin_mkt_nm
## 1 www.concerts.livenation.com T N Boston
## 2 NULL T N Seattle
## 3 NULL T N Arizona
## 4 NULL T N Dallas
## 5 NULL T N Seattle
## 6 www.livenation.com T N Los Angeles
## web_session_cookie_val gndr_cd age_yr income_amt edu_val
## 1 7dfa56dd7d5956b17587 <NA> <NA> <NA> <NA>
## 2 4f9e6fc637eaf7b736c2 <NA> <NA> <NA> <NA>
## 3 6c2545703bd527a7144d <NA> <NA> <NA> <NA>
## 4 527d6b1eaffc69ddd882 <NA> <NA> <NA> <NA>
## 5 8bd62c394a35213bdf52 <NA> <NA> <NA> <NA>
## 6 3b3a628f83135acd0676 <NA> <NA> <NA> <NA>
## edu_1st_indv_val edu_2nd_indv_val adults_in_hh_num married_ind
## 1 <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA>
## child_present_ind home_owner_ind occpn_val occpn_1st_val occpn_2nd_val
## 1 <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA> <NA>
## dist_to_ven
## 1 NA
## 2 59
## 3 NA
## 4 NA
## 5 NA
## 6 NA
names(sales)
## [1] "X" "event_id"
## [3] "primary_act_id" "secondary_act_id"
## [5] "purch_party_lkup_id" "event_name"
## [7] "primary_act_name" "secondary_act_name"
## [9] "major_cat_name" "minor_cat_name"
## [11] "la_event_type_cat" "event_disp_name"
## [13] "ticket_text" "tickets_purchased_qty"
## [15] "trans_face_val_amt" "delivery_type_cd"
## [17] "event_date_time" "event_dt"
## [19] "presale_dt" "onsale_dt"
## [21] "sales_ord_create_dttm" "sales_ord_tran_dt"
## [23] "print_dt" "timezn_nm"
## [25] "venue_city" "venue_state"
## [27] "venue_postal_cd_sgmt_1" "sales_platform_cd"
## [29] "print_flg" "la_valid_tkt_event_flg"
## [31] "fin_mkt_nm" "web_session_cookie_val"
## [33] "gndr_cd" "age_yr"
## [35] "income_amt" "edu_val"
## [37] "edu_1st_indv_val" "edu_2nd_indv_val"
## [39] "adults_in_hh_num" "married_ind"
## [41] "child_present_ind" "home_owner_ind"
## [43] "occpn_val" "occpn_1st_val"
## [45] "occpn_2nd_val" "dist_to_ven"
# Conclusion: rows are individual purchases, columns are information about each purchase - good!
# Get a feel for the data.
str(sales)
## 'data.frame': 5000 obs. of 46 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ event_id : Factor w/ 3746 levels "00071bfcbb27802045b2",..: 2477 1559 2894 846 1216 1119 244 1276 229 2114 ...
## $ primary_act_id : Factor w/ 709 levels "00166bacddabff148a03",..: 190 85 214 495 405 482 452 405 59 677 ...
## $ secondary_act_id : Factor w/ 535 levels "00a4512e22fe9d3a1350",..: 387 509 387 387 387 185 387 387 513 454 ...
## $ purch_party_lkup_id : Factor w/ 4978 levels "000f44312eae9b7e5cae",..: 2476 1565 2110 1620 2742 1185 422 3888 4591 2974 ...
## $ event_name : Factor w/ 2512 levels "\"\"\"\"\"\"\"\"weird Al\"\"\"\"\"\"\"\" Yankovic - the mandatory world tour",..: 2494 845 451 763 1412 575 1369 1914 175 1869 ...
## $ primary_act_name : Factor w/ 710 levels "3 Doors Down",..: 700 249 462 238 690 205 478 690 268 528 ...
## $ secondary_act_name : Factor w/ 537 levels ".38 Special",..: 329 107 329 329 329 227 329 329 54 368 ...
## $ major_cat_name : Factor w/ 5 levels "ARTS","CONCERTS",..: 4 4 4 4 4 4 4 4 4 2 ...
## $ minor_cat_name : Factor w/ 44 levels "ADULT CONTEMPORARY",..: 30 7 30 30 30 39 30 30 30 16 ...
## $ la_event_type_cat : Factor w/ 7 levels "ARTS","CONCERTS",..: 5 4 5 5 5 7 5 5 5 2 ...
## $ event_disp_name : Factor w/ 2511 levels "\"\"\"\"\"\"\"\"weird Al\"\"\"\"\"\"\"\" Yankovic - the mandatory world tour",..: 2493 844 450 762 1411 574 1368 1913 174 1868 ...
## $ ticket_text : Factor w/ 3746 levels " STYX W/ THE NASHVILLE SYMPHONY ASCEND AMPHITHEATER R"| __truncated__,..: 1375 2019 2292 1452 3147 145 3385 1994 1629 179 ...
## $ tickets_purchased_qty : int 1 1 1 1 1 2 1 1 1 1 ...
## $ trans_face_val_amt : num 45 75 5 20 20 10 30 28 20 25 ...
## $ delivery_type_cd : Factor w/ 7 levels "BWC","eTicket",..: 2 6 6 4 4 6 6 4 6 2 ...
## $ event_date_time : Factor w/ 3178 levels "2005-02-21 00:30:00",..: 2351 1062 171 1353 78 1487 238 1765 1575 1945 ...
## $ event_dt : Factor w/ 1635 levels "2005-02-20","2005-03-19",..: 1301 606 125 788 58 887 161 1056 940 1176 ...
## $ presale_dt : Factor w/ 421 levels "2005-01-28","2005-02-23",..: 421 421 421 421 4 421 421 421 421 421 ...
## $ onsale_dt : Factor w/ 1040 levels "2004-12-11","2005-01-14",..: 906 382 71 522 5 608 110 695 652 837 ...
## $ sales_ord_create_dttm : Factor w/ 3860 levels "2005-01-14 00:00:00",..: 2731 781 157 1039 63 1309 222 1436 1334 2104 ...
## $ sales_ord_tran_dt : Factor w/ 1849 levels "2005-01-14","2005-01-31",..: 1683 803 158 1046 63 1221 223 1314 1238 1556 ...
## $ print_dt : Factor w/ 1867 levels "1900-01-01","2005-01-16",..: 1692 855 185 1070 82 1232 250 1327 1250 1562 ...
## $ timezn_nm : Factor w/ 4 levels "CST","EST","MST",..: 2 4 3 1 4 4 2 4 3 4 ...
## $ venue_city : Factor w/ 199 levels "ABBOTSFORD","AKRON",..: 99 140 132 48 13 155 77 13 132 157 ...
## $ venue_state : Factor w/ 48 levels "ALABAMA","ALBERTA",..: 21 46 3 44 46 6 28 46 3 6 ...
## $ venue_postal_cd_sgmt_1: Factor w/ 312 levels "01608","02035",..: 3 276 219 198 270 244 23 270 221 256 ...
## $ sales_platform_cd : Factor w/ 15 levels "","android.ticketmaster.us",..: 11 10 10 10 10 12 10 11 12 7 ...
## $ print_flg : Factor w/ 2 levels "F ","T ": 2 2 2 2 2 2 2 2 2 2 ...
## $ la_valid_tkt_event_flg: Factor w/ 2 levels "N ","Y ": 1 1 1 1 1 1 1 1 1 2 ...
## $ fin_mkt_nm : Factor w/ 51 levels "Arizona","Atlanta",..: 4 43 1 13 43 23 33 43 1 29 ...
## $ web_session_cookie_val: Factor w/ 4978 levels "000f44312eae9b7e5cae",..: 2476 1565 2110 1620 2742 1185 422 3888 4591 2974 ...
## $ gndr_cd : Factor w/ 3 levels "F","M","NULL": NA NA NA NA NA NA 2 NA NA NA ...
## $ age_yr : Factor w/ 35 levels "18","20","22",..: NA NA NA NA NA NA 6 NA NA NA ...
## $ income_amt : Factor w/ 10 levels "10000","112500",..: NA NA NA NA NA NA 2 NA NA NA ...
## $ edu_val : Factor w/ 4 levels "College","Graduate School",..: NA NA NA NA NA NA 3 NA NA NA ...
## $ edu_1st_indv_val : Factor w/ 4 levels "College","Graduate School",..: NA NA NA NA NA NA 3 NA NA NA ...
## $ edu_2nd_indv_val : Factor w/ 4 levels "College","Graduate School",..: NA NA NA NA NA NA 4 NA NA NA ...
## $ adults_in_hh_num : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA 4 NA NA NA ...
## $ married_ind : Factor w/ 3 levels "0","1","NULL": NA NA NA NA NA NA 1 NA NA NA ...
## $ child_present_ind : Factor w/ 3 levels "0","1","NULL": NA NA NA NA NA NA 2 NA NA NA ...
## $ home_owner_ind : Factor w/ 3 levels "0","1","NULL": NA NA NA NA NA NA 1 NA NA NA ...
## $ occpn_val : Factor w/ 11 levels "Admin Managerial",..: NA NA NA NA NA NA 5 NA NA NA ...
## $ occpn_1st_val : Factor w/ 11 levels "Admin Managerial",..: NA NA NA NA NA NA 3 NA NA NA ...
## $ occpn_2nd_val : Factor w/ 10 levels "Admin Managerial",..: NA NA NA NA NA NA 5 NA NA NA ...
## $ dist_to_ven : int NA 59 NA NA NA NA NA NA NA NA ...
summary(sales)
## X event_id primary_act_id
## Min. : 1 84a260b1bcd31e2e75a7: 13 4b677c3f5bec71eec8d1: 208
## 1st Qu.:1251 6c56d7f08c95f2aa453c: 10 1a3e9aecd0617706a794: 167
## Median :2500 6ce493f24421534b4040: 9 6cdc2e270775b7e2f709: 148
## Mean :2500 24d74ef53592d1e950fc: 8 ac4b847b3fde66f2117e: 143
## 3rd Qu.:3750 b62b844fd17979d24df6: 8 43f0436b905bfa7c2eec: 116
## Max. :5000 b67715ea1653ae26356f: 8 3f510718b680022e6c39: 111
## (Other) :4944 (Other) :4107
## secondary_act_id purch_party_lkup_id
## b85143bf51323b72e53c:3414 4834e7c166768041a7c3: 3
## e2981973281c70939168: 51 08cb715b804edce092c1: 2
## f53529c5679ea6ca5a48: 47 1d407fe16b5ea4b880f2: 2
## 9021d10ae169fed0ebb8: 30 23cd7da8896a31c87453: 2
## 8d74e7609bc261c55a13: 26 27ec6221921b66698dc7: 2
## 7205f93a45b2e20210bf: 25 29ebf9ce8bad4d323f67: 2
## (Other) :1407 (Other) :4987
## event_name
## Beyonce - the formation world tour : 85
## Dave Matthews Band : 42
## Coldplay- A Head Full Of Dreams Tour : 29
## HOUSE OF BLUES PASS THE LINE : 27
## Premier Parking: Dave Matthews Band : 27
## Luke Bryan: Kick The Dust Up Tour 2015: 26
## (Other) :4764
## primary_act_name
## Parking Event : 208
## Gorge Camping : 167
## Vip Fast Lane : 148
## Fast Lane Access : 143
## XFINITY Center Mansfield Premier Parking : 116
## Verizon Wireless Amph. Irvine Premier Parking: 111
## (Other) :4107
## secondary_act_name major_cat_name
## NULL :3414 ARTS : 25
## Sasquatch! Festival: 51 CONCERTS:1998
## Dave Matthews Band : 47 FAMILY : 4
## Randy Houser : 30 MISC :2972
## Panic! At The Disco: 26 SPORTS : 1
## Hunter Hayes : 25
## (Other) :1407
## minor_cat_name la_event_type_cat
## PARKING :2314 ARTS : 104
## ROCK/POP : 721 CONCERTS:1906
## ALTERNATIVE ROCK : 402 FAMILY : 4
## SPECIAL ENTRY (UPSELL): 311 INVALID : 171
## COUNTRY : 238 PARKING :2324
## CAMPING : 158 SPORTS : 1
## (Other) : 856 UPSELL : 490
## event_disp_name
## Beyonce - the formation world tour : 85
## Dave Matthews Band : 42
## Coldplay- A Head Full Of Dreams Tour : 29
## HOUSE OF BLUES PASS THE LINE : 27
## Premier Parking: Dave Matthews Band : 27
## Luke Bryan: Kick The Dust Up Tour 2015: 26
## (Other) :4764
## ticket_text
## %OVERNIGHT C A M P I N G%SASQUATCH!%GORGE CAMPGROUND%GOOD THESE DAYS ONLY%MAY 22 - 25, 2009 : 13
## %OVERNIGHT C A M P I N G%* * * * * *%GORGE CAMPGROUND%* GOOD THIS DATE ONLY *%SEP 3 - 6, 2009 : 10
## LIVE NATION PRESENTS COLDPLAY A HEAD FULL OF DREAMS TOUR AT&T STADIUM ALL TAXES INCLUDED SAT AUG 27 2016 8:00 PM : 9
## LIVE NATION PRESENTS BEYONCE THE FORMATION WORLD TOUR CITI FIELD RAIN OR SHINE TUE JUN 07 2016 6:00PM : 8
## Live Nation Presents BEYONCE The Formation World Tour CENTURYLINK FIELD RAIN OR SHINE WED MAY 18 2016 6:00PM : 8
## LIVE NATION PRESENTS BEYONCE THE FORMATION WORLD TOUR ROSE BOWL, PASADENA, CA RAIN OR SHINE SAT MAY 14 2016 6PM : 8
## (Other) :4944
## tickets_purchased_qty trans_face_val_amt delivery_type_cd
## Min. :1.000 Min. : 1.00 BWC : 75
## 1st Qu.:1.000 1st Qu.: 20.00 eTicket :1301
## Median :1.000 Median : 30.00 ISPU : 53
## Mean :1.639 Mean : 77.08 Mail :1504
## 3rd Qu.:2.000 3rd Qu.: 85.00 Paperless : 13
## Max. :8.000 Max. :1520.88 TicketFast:1893
## UPS : 161
## event_date_time event_dt presale_dt
## 2009-05-23 19:00:00: 17 2008-08-22: 18 NULL :2892
## 2009-09-05 01:00:00: 12 2008-05-24: 17 2016-02-09: 71
## 2016-08-07 00:00:00: 10 2009-05-23: 17 2016-02-03: 40
## 2016-08-28 01:00:00: 9 2015-07-18: 17 2016-01-19: 39
## 2008-05-24 19:00:00: 8 2016-05-14: 17 2016-02-15: 33
## 2008-07-17 02:00:00: 8 2008-08-02: 16 2016-02-16: 32
## (Other) :4936 (Other) :4898 (Other) :1893
## onsale_dt sales_ord_create_dttm sales_ord_tran_dt
## NULL : 101 2006-04-08 00:00:00: 19 2016-01-29: 51
## 2016-02-05: 82 2006-05-06 00:00:00: 19 2016-02-09: 49
## 2016-01-22: 61 2006-05-05 00:00:00: 15 2016-02-19: 45
## 2015-04-24: 55 2008-03-10 00:00:00: 14 2016-02-12: 31
## 2016-02-16: 55 2005-04-02 00:00:00: 12 2016-02-15: 30
## 2016-02-19: 54 2007-04-21 00:00:00: 12 2016-02-05: 29
## (Other) :4592 (Other) :4909 (Other) :4765
## print_dt timezn_nm venue_city venue_state
## NULL : 424 CST:1175 PHOENIX : 213 CALIFORNIA : 712
## 1900-01-01: 35 EST:2353 ATLANTA : 210 NEW YORK : 381
## 2016-02-05: 23 MST: 285 CHARLOTTE : 171 WASHINGTON : 347
## 2016-02-12: 20 PST:1187 IRVINE : 152 INDIANA : 296
## 2016-02-23: 20 MANSFIELD : 150 NORTH CAROLINA: 293
## 2016-02-26: 20 INDIANAPOLIS: 149 TEXAS : 270
## (Other) :4458 (Other) :3955 (Other) :2701
## venue_postal_cd_sgmt_1 sales_platform_cd print_flg
## 98848 : 245 NULL :2421 F : 459
## 92618 : 152 www.concerts.livenation.com:1097 T :4541
## 02048 : 150 www.ticketmaster.com : 688
## 46204 : 148 mobile.livenation.us : 198
## 46060 : 147 iphone.ticketmaster.us : 173
## 30303 : 144 mobile.ticketmaster.us : 122
## (Other):4014 (Other) : 301
## la_valid_tkt_event_flg fin_mkt_nm
## N :2985 New York : 462
## Y :2015 Boston : 381
## Seattle : 339
## Los Angeles : 332
## Indiana-Ohio : 295
## N. California: 283
## (Other) :2908
## web_session_cookie_val gndr_cd age_yr income_amt
## 4834e7c166768041a7c3: 3 F : 92 NULL : 47 NULL : 61
## 08cb715b804edce092c1: 2 M : 85 24 : 13 62500 : 41
## 1d407fe16b5ea4b880f2: 2 NULL: 38 34 : 10 200000 : 28
## 23cd7da8896a31c87453: 2 NA's:4785 30 : 9 87500 : 28
## 27ec6221921b66698dc7: 2 44 : 9 45000 : 19
## 29ebf9ce8bad4d323f67: 2 (Other): 127 (Other): 38
## (Other) :4987 NA's :4785 NA's :4785
## edu_val edu_1st_indv_val edu_2nd_indv_val
## College : 41 College : 36 College : 27
## Graduate School: 25 Graduate School: 18 Graduate School: 12
## High School : 73 High School : 58 High School : 47
## NULL : 76 NULL : 103 NULL : 129
## NA's :4785 NA's :4785 NA's :4785
##
##
## adults_in_hh_num married_ind child_present_ind home_owner_ind
## 1 : 52 0 : 52 0 : 57 0 : 8
## 2 : 48 1 : 105 1 : 73 1 : 130
## NULL : 39 NULL: 58 NULL: 85 NULL: 77
## 3 : 30 NA's:4785 NA's:4785 NA's:4785
## 4 : 25
## (Other): 21
## NA's :4785
## occpn_val occpn_1st_val
## NULL : 136 NULL : 138
## Professional Technical: 27 Professional Technical: 29
## Clerical White Collar : 14 Admin Managerial : 11
## Craftsman Blue Collar : 13 Clerical White Collar : 10
## Homemaker : 8 Craftsman Blue Collar : 10
## (Other) : 17 (Other) : 17
## NA's :4785 NA's :4785
## occpn_2nd_val dist_to_ven
## NULL : 152 Min. : 0.0
## Professional Technical: 21 1st Qu.: 12.0
## Clerical WhiteCollar : 13 Median : 26.0
## Homemaker : 12 Mean : 158.2
## Craftsman BlueCollar : 7 3rd Qu.: 77.5
## (Other) : 10 Max. :2548.0
## NA's :4785 NA's :4677
library(dplyr)
glimpse(sales)
## Observations: 5,000
## Variables: 46
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...
## $ event_id <fct> abcaf1adb99a935fc661, 6c56d7f08c95f2aa4...
## $ primary_act_id <fct> 43f0436b905bfa7c2eec, 1a3e9aecd0617706a...
## $ secondary_act_id <fct> b85143bf51323b72e53c, f53529c5679ea6ca5...
## $ purch_party_lkup_id <fct> 7dfa56dd7d5956b17587, 4f9e6fc637eaf7b73...
## $ event_name <fct> Xfinity Center Mansfield Premier Parkin...
## $ primary_act_name <fct> XFINITY Center Mansfield Premier Parkin...
## $ secondary_act_name <fct> NULL, Dave Matthews Band, NULL, NULL, N...
## $ major_cat_name <fct> MISC, MISC, MISC, MISC, MISC, MISC, MIS...
## $ minor_cat_name <fct> PARKING, CAMPING, PARKING, PARKING, PAR...
## $ la_event_type_cat <fct> PARKING, INVALID, PARKING, PARKING, PAR...
## $ event_disp_name <fct> Xfinity Center Mansfield Premier Parkin...
## $ ticket_text <fct> THIS TICKET IS VALID FOR PARK...
## $ tickets_purchased_qty <int> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 4, ...
## $ trans_face_val_amt <dbl> 45, 75, 5, 20, 20, 10, 30, 28, 20, 25, ...
## $ delivery_type_cd <fct> eTicket, TicketFast, TicketFast, Mail, ...
## $ event_date_time <fct> 2015-09-12 23:30:00, 2009-09-05 01:00:0...
## $ event_dt <fct> 2015-09-12, 2009-09-04, 2006-04-21, 201...
## $ presale_dt <fct> NULL, NULL, NULL, NULL, 2005-03-02, NUL...
## $ onsale_dt <fct> 2015-05-15, 2009-03-13, 2006-02-25, 201...
## $ sales_ord_create_dttm <fct> 2015-09-11 18:17:45, 2009-07-06 00:00:0...
## $ sales_ord_tran_dt <fct> 2015-09-11, 2009-07-05, 2006-04-05, 201...
## $ print_dt <fct> 2015-09-12, 2009-09-01, 2006-04-05, 201...
## $ timezn_nm <fct> EST, PST, MST, CST, PST, PST, EST, PST,...
## $ venue_city <fct> MANSFIELD, QUINCY, PHOENIX, DALLAS, AUB...
## $ venue_state <fct> MASSACHUSETTS, WASHINGTON, ARIZONA, TEX...
## $ venue_postal_cd_sgmt_1 <fct> 02048, 98848, 85003, 75210, 98092, 9240...
## $ sales_platform_cd <fct> www.concerts.livenation.com, NULL, NULL...
## $ print_flg <fct> T , T , T , T , T , T , T , T , T , T ,...
## $ la_valid_tkt_event_flg <fct> N , N , N , N , N , N , N , N , N , Y ,...
## $ fin_mkt_nm <fct> Boston, Seattle, Arizona, Dallas, Seatt...
## $ web_session_cookie_val <fct> 7dfa56dd7d5956b17587, 4f9e6fc637eaf7b73...
## $ gndr_cd <fct> NA, NA, NA, NA, NA, NA, M, NA, NA, NA, ...
## $ age_yr <fct> NA, NA, NA, NA, NA, NA, 28, NA, NA, NA,...
## $ income_amt <fct> NA, NA, NA, NA, NA, NA, 112500, NA, NA,...
## $ edu_val <fct> NA, NA, NA, NA, NA, NA, High School, NA...
## $ edu_1st_indv_val <fct> NA, NA, NA, NA, NA, NA, High School, NA...
## $ edu_2nd_indv_val <fct> NA, NA, NA, NA, NA, NA, NULL, NA, NA, N...
## $ adults_in_hh_num <fct> NA, NA, NA, NA, NA, NA, 4, NA, NA, NA, ...
## $ married_ind <fct> NA, NA, NA, NA, NA, NA, 0, NA, NA, NA, ...
## $ child_present_ind <fct> NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, ...
## $ home_owner_ind <fct> NA, NA, NA, NA, NA, NA, 0, NA, NA, NA, ...
## $ occpn_val <fct> NA, NA, NA, NA, NA, NA, NULL, NA, NA, N...
## $ occpn_1st_val <fct> NA, NA, NA, NA, NA, NA, Craftsman Blue ...
## $ occpn_2nd_val <fct> NA, NA, NA, NA, NA, NA, NULL, NA, NA, N...
## $ dist_to_ven <int> NA, 59, NA, NA, NA, NA, NA, NA, NA, NA,...
# Remove first column (obs no).
sales2 <- sales[,-1]
# Remove first 4 columns (codes) and last 15 columns (too many NAs).
sales3 <- sales2[,c(5:(ncol(sales2) - 15))]
# Separate the date times into dates and times.
library(tidyr)
sales4 <- separate(sales3, event_date_time,
into = c("event_dt", "event_time"), sep = " ")
sales5 <- separate(sales4, sales_ord_create_dttm,
into = c("ord_create_dt", "ord_create_time"), sep = " ")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 4 rows
## [2516, 3863, 4082, 4183].
# Second command threw warnings. View problem records.
issues <- c(2516, 3863, 4082, 4183)
sales3$sales_ord_create_dttm[issues]
## [1] NULL NULL NULL NULL
## 3860 Levels: 2005-01-14 00:00:00 2005-01-31 00:00:00 ... NULL
# For comparison, a well-behaved value of sales_ord_create_dttm.
sales3$sales_ord_create_dttm[2517]
## [1] 2013-08-04 23:07:19
## 3860 Levels: 2005-01-14 00:00:00 2005-01-31 00:00:00 ... NULL
# Issue is missing values. May need to drop records.
# Coerce dates strings into dates. Use fact that data columns have "dt" in name.
library(stringr)
date_cols <- str_detect(colnames(sales5), "dt")
library(lubridate)
sales5[, date_cols] <- lapply(sales5[,date_cols], ymd)
## Warning: 2892 failed to parse.
## Warning: 101 failed to parse.
## Warning: 4 failed to parse.
## Warning: 424 failed to parse.
# Note the warning messages. Are they due to NAs?
missing <- lapply(sales5[, date_cols], is.na)
sapply(missing, sum)
## event_dt presale_dt onsale_dt ord_create_dt
## 0 2892 101 4
## sales_ord_tran_dt print_dt
## 0 424
# Conclusion: the number of NAs in each column match the numbers from the warning messages, so missing data is the culprit.
# Combine the venue_city and venue_state columns
sales6 <- unite(sales5, venue_city_state, venue_city, venue_state, sep = ", ")
library(readxl)
# Read Excel data file. Discard first row (title).
mbta_url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1294/datasets/mbta.xlsx"
# Cannot read Excel directly from internet, so downloaded to local drive.
# Following command works, but content is unreadable. Comment out.
# download.file(mbta_url, file.path("Programs/Data", "mbta.xlsx"))
mbta_path <- file.path("C:/Users/mpfol/OneDrive/Documents/Data Analysis/Programs/Data", "mbta.xlsx")
mbta <- read_excel(mbta_path, skip = 1)
# Examine organization.
str(mbta)
## Classes 'tbl_df', 'tbl' and 'data.frame': 11 obs. of 60 variables:
## $ X__1 : num 1 2 3 4 5 6 7 8 9 10 ...
## $ mode : chr "All Modes by Qtr" "Boat" "Bus" "Commuter Rail" ...
## $ 2007-01: chr "NA" "4" "335.81900000000002" "142.19999999999999" ...
## $ 2007-02: chr "NA" "3.6" "338.67500000000001" "138.5" ...
## $ 2007-03: num 1188 40 340 138 459 ...
## $ 2007-04: chr "NA" "4.3" "352.16199999999998" "139.5" ...
## $ 2007-05: chr "NA" "4.9000000000000004" "354.36700000000002" "139" ...
## $ 2007-06: num 1246 5.8 350.5 143 477 ...
## $ 2007-07: chr "NA" "6.5209999999999999" "357.51900000000001" "142.39099999999999" ...
## $ 2007-08: chr "NA" "6.5720000000000001" "355.47899999999998" "142.364" ...
## $ 2007-09: num 1256.57 5.47 372.6 143.05 499.57 ...
## $ 2007-10: chr "NA" "5.1449999999999996" "368.84699999999998" "146.542" ...
## $ 2007-11: chr "NA" "3.7629999999999999" "330.82600000000002" "145.089" ...
## $ 2007-12: num 1216.89 2.98 312.92 141.59 448.27 ...
## $ 2008-01: chr "NA" "3.1749999999999998" "340.32400000000001" "142.14500000000001" ...
## $ 2008-02: chr "NA" "3.1110000000000002" "352.90499999999997" "142.607" ...
## $ 2008-03: num 1253.52 3.51 361.15 137.45 494.05 ...
## $ 2008-04: chr "NA" "4.1639999999999997" "368.18900000000002" "140.38900000000001" ...
## $ 2008-05: chr "NA" "4.0149999999999997" "363.90300000000002" "142.58500000000001" ...
## $ 2008-06: num 1314.82 5.19 362.96 142.06 518.35 ...
## $ 2008-07: chr "NA" "6.016" "370.92099999999999" "145.73099999999999" ...
## $ 2008-08: chr "NA" "5.8" "361.05700000000002" "144.565" ...
## $ 2008-09: num 1307.04 4.59 389.54 141.91 517.32 ...
## $ 2008-10: chr "NA" "4.2850000000000001" "357.97399999999999" "151.95699999999999" ...
## $ 2008-11: chr "NA" "3.488" "345.423" "152.952" ...
## $ 2008-12: num 1232.65 3.01 325.77 140.81 446.74 ...
## $ 2009-01: chr "NA" "3.0139999999999998" "338.53199999999998" "141.44800000000001" ...
## $ 2009-02: chr "NA" "3.1960000000000002" "360.41199999999998" "143.529" ...
## $ 2009-03: num 1209.79 3.33 353.69 142.89 467.22 ...
## $ 2009-04: chr "NA" "4.0490000000000004" "359.38" "142.34" ...
## $ 2009-05: chr "NA" "4.1189999999999998" "354.75" "144.22499999999999" ...
## $ 2009-06: num 1233.1 4.9 347.9 142 473.1 ...
## $ 2009-07: chr "NA" "6.444" "339.47699999999998" "137.691" ...
## $ 2009-08: chr "NA" "5.9029999999999996" "332.661" "139.15799999999999" ...
## $ 2009-09: num 1230.5 4.7 374.3 139.1 500.4 ...
## $ 2009-10: chr "NA" "4.2119999999999997" "385.86799999999999" "137.10400000000001" ...
## $ 2009-11: chr "NA" "3.5760000000000001" "366.98" "129.34299999999999" ...
## $ 2009-12: num 1207.85 3.11 332.39 126.07 440.93 ...
## $ 2010-01: chr "NA" "3.2069999999999999" "362.226" "130.91" ...
## $ 2010-02: chr "NA" "3.1949999999999998" "361.13799999999998" "131.91800000000001" ...
## $ 2010-03: num 1208.86 3.48 373.44 131.25 483.4 ...
## $ 2010-04: chr "NA" "4.452" "378.61099999999999" "131.72200000000001" ...
## $ 2010-05: chr "NA" "4.415" "380.17099999999999" "128.80000000000001" ...
## $ 2010-06: num 1244.41 5.41 363.27 129.14 490.26 ...
## $ 2010-07: chr "NA" "6.5129999999999999" "353.04" "122.935" ...
## $ 2010-08: chr "NA" "6.2690000000000001" "343.68799999999999" "129.732" ...
## $ 2010-09: num 1225.5 4.7 381.6 132.9 521.1 ...
## $ 2010-10: chr "NA" "4.4020000000000001" "384.98700000000002" "131.03299999999999" ...
## $ 2010-11: chr "NA" "3.7309999999999999" "367.95499999999998" "130.88900000000001" ...
## $ 2010-12: num 1216.26 3.16 326.34 121.42 450.43 ...
## $ 2011-01: chr "NA" "3.14" "334.95800000000003" "128.39599999999999" ...
## $ 2011-02: chr "NA" "3.2839999999999998" "346.23399999999998" "125.46299999999999" ...
## $ 2011-03: num 1223.45 3.67 380.4 134.37 516.73 ...
## $ 2011-04: chr "NA" "4.2510000000000003" "380.44600000000003" "134.16900000000001" ...
## $ 2011-05: chr "NA" "4.431" "385.28899999999999" "136.13999999999999" ...
## $ 2011-06: num 1302.41 5.47 376.32 135.58 529.53 ...
## $ 2011-07: chr "NA" "6.5810000000000004" "361.58499999999998" "132.41" ...
## $ 2011-08: chr "NA" "6.7329999999999997" "353.79300000000001" "130.61600000000001" ...
## $ 2011-09: num 1291 5 388 137 550 ...
## $ 2011-10: chr "NA" "4.484" "398.45600000000002" "128.72" ...
head(mbta)
## # A tibble: 6 x 60
## X__1 mode `2007-01` `2007-02` `2007-03` `2007-04` `2007-05` `2007-06`
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 1. All M~ NA NA 1188. NA NA 1246.
## 2 2. Boat 4 3.6 40.0 4.3 4.900000~ 5.80
## 3 3. Bus 335.8190~ 338.6750~ 340. 352.1619~ 354.3670~ 351.
## 4 4. Commu~ 142.1999~ 138.5 138. 139.5 139 143.
## 5 5. Heavy~ 435.2939~ 448.2710~ 459. 472.2010~ 474.5790~ 477.
## 6 6. Light~ 227.2309~ 240.262 241. 255.5569~ 248.262 246.
## # ... with 52 more variables: `2007-07` <chr>, `2007-08` <chr>,
## # `2007-09` <dbl>, `2007-10` <chr>, `2007-11` <chr>, `2007-12` <dbl>,
## # `2008-01` <chr>, `2008-02` <chr>, `2008-03` <dbl>, `2008-04` <chr>,
## # `2008-05` <chr>, `2008-06` <dbl>, `2008-07` <chr>, `2008-08` <chr>,
## # `2008-09` <dbl>, `2008-10` <chr>, `2008-11` <chr>, `2008-12` <dbl>,
## # `2009-01` <chr>, `2009-02` <chr>, `2009-03` <dbl>, `2009-04` <chr>,
## # `2009-05` <chr>, `2009-06` <dbl>, `2009-07` <chr>, `2009-08` <chr>,
## # `2009-09` <dbl>, `2009-10` <chr>, `2009-11` <chr>, `2009-12` <dbl>,
## # `2010-01` <chr>, `2010-02` <chr>, `2010-03` <dbl>, `2010-04` <chr>,
## # `2010-05` <chr>, `2010-06` <dbl>, `2010-07` <chr>, `2010-08` <chr>,
## # `2010-09` <dbl>, `2010-10` <chr>, `2010-11` <chr>, `2010-12` <dbl>,
## # `2011-01` <chr>, `2011-02` <chr>, `2011-03` <dbl>, `2011-04` <chr>,
## # `2011-05` <chr>, `2011-06` <dbl>, `2011-07` <chr>, `2011-08` <chr>,
## # `2011-09` <dbl>, `2011-10` <chr>
summary(mbta)
## X__1 mode 2007-01 2007-02
## Min. : 1.0 Length:11 Length:11 Length:11
## 1st Qu.: 3.5 Class :character Class :character Class :character
## Median : 6.0 Mode :character Mode :character Mode :character
## Mean : 6.0
## 3rd Qu.: 8.5
## Max. :11.0
## 2007-03 2007-04 2007-05
## Min. : 0.114 Length:11 Length:11
## 1st Qu.: 9.278 Class :character Class :character
## Median : 137.700 Mode :character Mode :character
## Mean : 330.293
## 3rd Qu.: 399.225
## Max. :1204.725
## 2007-06 2007-07 2007-08
## Min. : 0.096 Length:11 Length:11
## 1st Qu.: 5.700 Class :character Class :character
## Median : 143.000 Mode :character Mode :character
## Mean : 339.846
## 3rd Qu.: 413.788
## Max. :1246.129
## 2007-09 2007-10 2007-11
## Min. : -0.007 Length:11 Length:11
## 1st Qu.: 5.539 Class :character Class :character
## Median : 143.051 Mode :character Mode :character
## Mean : 352.554
## 3rd Qu.: 436.082
## Max. :1310.764
## 2007-12 2008-01 2008-02
## Min. : -0.060 Length:11 Length:11
## 1st Qu.: 4.385 Class :character Class :character
## Median : 141.585 Mode :character Mode :character
## Mean : 321.588
## 3rd Qu.: 380.594
## Max. :1216.890
## 2008-03 2008-04 2008-05
## Min. : 0.058 Length:11 Length:11
## 1st Qu.: 5.170 Class :character Class :character
## Median : 137.453 Mode :character Mode :character
## Mean : 345.604
## 3rd Qu.: 427.601
## Max. :1274.031
## 2008-06 2008-07 2008-08
## Min. : 0.060 Length:11 Length:11
## 1st Qu.: 5.742 Class :character Class :character
## Median : 142.057 Mode :character Mode :character
## Mean : 359.667
## 3rd Qu.: 440.656
## Max. :1320.728
## 2008-09 2008-10 2008-11
## Min. : 0.021 Length:11 Length:11
## 1st Qu.: 5.691 Class :character Class :character
## Median : 141.907 Mode :character Mode :character
## Mean : 362.099
## 3rd Qu.: 453.430
## Max. :1338.015
## 2008-12 2009-01 2009-02
## Min. : -0.015 Length:11 Length:11
## 1st Qu.: 4.689 Class :character Class :character
## Median : 140.810 Mode :character Mode :character
## Mean : 319.882
## 3rd Qu.: 386.255
## Max. :1232.655
## 2009-03 2009-04 2009-05
## Min. : -0.050 Length:11 Length:11
## 1st Qu.: 5.003 Class :character Class :character
## Median : 142.893 Mode :character Mode :character
## Mean : 330.142
## 3rd Qu.: 410.455
## Max. :1210.912
## 2009-06 2009-07 2009-08
## Min. : -0.079 Length:11 Length:11
## 1st Qu.: 5.845 Class :character Class :character
## Median : 142.006 Mode :character Mode :character
## Mean : 333.194
## 3rd Qu.: 410.482
## Max. :1233.085
## 2009-09 2009-10 2009-11
## Min. : -0.035 Length:11 Length:11
## 1st Qu.: 5.693 Class :character Class :character
## Median : 139.087 Mode :character Mode :character
## Mean : 346.687
## 3rd Qu.: 437.332
## Max. :1291.564
## 2009-12 2010-01 2010-02
## Min. : -0.022 Length:11 Length:11
## 1st Qu.: 4.784 Class :character Class :character
## Median : 126.066 Mode :character Mode :character
## Mean : 312.962
## 3rd Qu.: 386.659
## Max. :1207.845
## 2010-03 2010-04 2010-05
## Min. : 0.012 Length:11 Length:11
## 1st Qu.: 5.274 Class :character Class :character
## Median : 131.252 Mode :character Mode :character
## Mean : 332.726
## 3rd Qu.: 428.420
## Max. :1225.556
## 2010-06 2010-07 2010-08
## Min. : 0.008 Length:11 Length:11
## 1st Qu.: 6.436 Class :character Class :character
## Median : 129.144 Mode :character Mode :character
## Mean : 335.964
## 3rd Qu.: 426.769
## Max. :1244.409
## 2010-09 2010-10 2010-11
## Min. : 0.001 Length:11 Length:11
## 1st Qu.: 5.567 Class :character Class :character
## Median : 132.892 Mode :character Mode :character
## Mean : 346.524
## 3rd Qu.: 451.361
## Max. :1293.117
## 2010-12 2011-01 2011-02
## Min. : -0.004 Length:11 Length:11
## 1st Qu.: 4.466 Class :character Class :character
## Median : 121.422 Mode :character Mode :character
## Mean : 312.917
## 3rd Qu.: 388.385
## Max. :1216.262
## 2011-03 2011-04 2011-05
## Min. : 0.05 Length:11 Length:11
## 1st Qu.: 6.03 Class :character Class :character
## Median : 134.37 Mode :character Mode :character
## Mean : 345.17
## 3rd Qu.: 448.56
## Max. :1286.66
## 2011-06 2011-07 2011-08
## Min. : 0.054 Length:11 Length:11
## 1st Qu.: 6.926 Class :character Class :character
## Median : 135.581 Mode :character Mode :character
## Mean : 353.331
## 3rd Qu.: 452.923
## Max. :1302.414
## 2011-09 2011-10
## Min. : 0.043 Length:11
## 1st Qu.: 6.660 Class :character
## Median : 136.901 Mode :character
## Mean : 362.555
## 3rd Qu.: 469.204
## Max. :1348.754
# Conclusion: observations stored as columns rather than as rows.
# Need to remove rows 1, 7, and 11 (All Modes By Qtr, Pct Chg / Yr, and TOTAL).
# Need to remove column 1 (row number)
# Gather the columns (yyyy-dd) into key-value pairs.
# Spread the modes into columns.
mbta2 <- mbta[-c(1,7,11),]
mbta3 <- mbta2[,-1]
library(tidyr)
mbta4 <- gather(mbta3, month, thou_riders, -c(mode))
mbta4$thou_riders <- as.numeric(mbta4$thou_riders)
mbta5 <- spread(mbta4, mode, thou_riders)
mbta6 <- separate(mbta5, col = "month", into = c("year", "month"), sep = "-")
# Screen for obvious mistakes and/or outliers.
summary(mbta6)
## year month Boat Bus
## Length:58 Length:58 Min. : 2.985 Min. :312.9
## Class :character Class :character 1st Qu.: 3.494 1st Qu.:345.6
## Mode :character Mode :character Median : 4.293 Median :359.9
## Mean : 5.068 Mean :358.6
## 3rd Qu.: 5.356 3rd Qu.:372.2
## Max. :40.000 Max. :398.5
## Commuter Rail Heavy Rail Light Rail Private Bus
## Min. :121.4 Min. :435.3 Min. :194.4 Min. :2.213
## 1st Qu.:131.4 1st Qu.:471.1 1st Qu.:220.6 1st Qu.:2.641
## Median :138.8 Median :487.3 Median :231.9 Median :2.820
## Mean :137.4 Mean :489.3 Mean :233.0 Mean :3.352
## 3rd Qu.:142.4 3rd Qu.:511.3 3rd Qu.:244.5 3rd Qu.:4.167
## Max. :153.0 Max. :554.9 Max. :271.1 Max. :4.878
## RIDE Trackless Trolley
## Min. :4.900 Min. : 5.777
## 1st Qu.:5.965 1st Qu.:11.679
## Median :6.615 Median :12.598
## Mean :6.604 Mean :12.125
## 3rd Qu.:7.149 3rd Qu.:13.320
## Max. :8.598 Max. :15.109
# Column Boat looks suspicious. 40 should be 4
hist(mbta6$Boat)
i <- which(mbta6$Boat == 40)
mbta6$Boat[i] <- 4
hist(mbta6$Boat)
mbta_boat <- mbta4 %>% filter(mode == "Boat" | mode == "Trackless Trolley")
# Look at Boat and Trackless Trolley ridership over time (don't change)
ggplot(mbta_boat, aes(x = month, y = thou_riders, col = mode)) + geom_point() +
scale_x_discrete(name = "Month", breaks = c(200701, 200801, 200901, 201001, 201101)) +
scale_y_continuous(name = "Avg Weekday Ridership (thousands)")
# Look at all T ridership over time (don't change)
ggplot(mbta4, aes(x = month, y = thou_riders, col = mode)) + geom_point() +
scale_x_discrete(name = "Month", breaks = c(200701, 200801, 200901, 201001, 201101)) +
scale_y_continuous(name = "Avg Weekday Ridership (thousands)")
# Read data. For large data sets, use fread() from the data.table package.
library(data.table)
food_url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1294/datasets/food.csv"
food <- fread(food_url, data.table = FALSE)
# Examine organization.
str(food)
## 'data.frame': 1500 obs. of 160 variables:
## $ V1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ code : int 100030 100050 100079 100094 100124 100136 100194 100221 100257 100258 ...
## $ url : chr "http://world-en.openfoodfacts.org/product/3222475745867/confiture-de-fraise-fraise-des-bois-au-sucre-de-canne-casino-delices" "http://world-en.openfoodfacts.org/product/5410976880110/guylian-sea-shells-selection" "http://world-en.openfoodfacts.org/product/3264750423503/pates-de-fruits-aromatisees-jacquot" "http://world-en.openfoodfacts.org/product/8006040247001/nata-vegetal-a-base-de-soja-valsoia" ...
## $ creator : chr "sebleouf" "foodorigins" "domdom26" "javichu" ...
## $ created_t : int 1424747544 1450316429 1428674916 1420416591 1420501121 1437983923 1442420988 1435686217 1436991777 1400516512 ...
## $ created_datetime : chr "2015-02-24T03:12:24Z" "2015-12-17T01:40:29Z" "2015-04-10T14:08:36Z" "2015-01-05T00:09:51Z" ...
## $ last_modified_t : int 1438445887 1450817956 1428739289 1420417876 1445700917 1445577476 1442420988 1451405288 1436991779 1437236856 ...
## $ last_modified_datetime : chr "2015-08-01T16:18:07Z" "2015-12-22T20:59:16Z" "2015-04-11T08:01:29Z" "2015-01-05T00:31:16Z" ...
## $ product_name : chr "Confiture de fraise fraise des bois au sucre de canne" "Guylian Sea Shells Selection" "Pâtes de fruits aromatisées" "Nata vegetal a base de soja "Valsoia"" ...
## $ generic_name : chr "" "" "Pâtes de fruits" "Nata vegetal a base de soja" ...
## $ quantity : chr "265 g" "375g" "1 kg" "200 ml" ...
## $ packaging : chr "Bocal,Verre" "Plastic,Box" "Carton,plastique" "Tetra Brik" ...
## $ packaging_tags : chr "bocal,verre" "plastic,box" "carton,plastique" "tetra-brik" ...
## $ brands : chr "Casino Délices" "Guylian" "Jacquot" "Valsoia,//Propiedad de://,Valsoia S.p.A." ...
## $ brands_tags : chr "casino-delices" "guylian" "jacquot" "valsoia,propiedad-de,valsoia-s-p-a" ...
## $ categories : chr "Aliments et boissons à base de végétaux,Aliments d'origine végétale,Aliments à base de fruits et de légu"| __truncated__ "Chocolate" "pâtes de fruits" "Alimentos y bebidas de origen vegetal,Alimentos de origen vegetal,Natas vegetales,Natas vegetales a base de soj"| __truncated__ ...
## $ categories_tags : chr "en:plant-based-foods-and-beverages,en:plant-based-foods,en:fruits-and-vegetables-based-foods,en:breakfasts,en:s"| __truncated__ "en:sugary-snacks,en:chocolates" "en:plant-based-foods-and-beverages,en:plant-based-foods,en:fruits-and-vegetables-based-foods,en:sugary-snacks,e"| __truncated__ "en:plant-based-foods-and-beverages,en:plant-based-foods,en:plant-based-creams,en:plant-based-creams-for-cooking"| __truncated__ ...
## $ categories_en : chr "Plant-based foods and beverages,Plant-based foods,Fruits and vegetables based foods,Breakfasts,Spreads,Fruits b"| __truncated__ "Sugary snacks,Chocolates" "Plant-based foods and beverages,Plant-based foods,Fruits and vegetables based foods,Sugary snacks,Confectioneri"| __truncated__ "Plant-based foods and beverages,Plant-based foods,Plant-based creams,Plant-based creams for cooking,Soy-based c"| __truncated__ ...
## $ origins : chr "" "" "" "" ...
## $ origins_tags : chr "" "" "" "" ...
## $ manufacturing_places : chr "France" "Belgium" "" "Italia" ...
## $ manufacturing_places_tags : chr "france" "belgium" "" "italia" ...
## $ labels : chr "" "" "" "Vegetariano,Vegano,Sin gluten,Sin OMG,Sin lactosa" ...
## $ labels_tags : chr "" "" "" "en:vegetarian,en:vegan,en:gluten-free,en:no-gmos,en:no-lactose" ...
## $ labels_en : chr "" "" "" "Vegetarian,Vegan,Gluten-free,No GMOs,No lactose" ...
## $ emb_codes : chr "EMB 78015" "" "" "" ...
## $ emb_codes_tags : chr "emb-78015" "" "" "" ...
## $ first_packaging_code_geo : chr "48.983333,2.066667" "" "" "" ...
## $ cities : logi NA NA NA NA NA NA ...
## $ cities_tags : chr "andresy-yvelines-france" "" "" "" ...
## $ purchase_places : chr "Lyon,France" "NSW,Australia" "France" "Madrid,España" ...
## $ stores : chr "Casino" "" "" "El Corte Inglés" ...
## $ countries : chr "France" "Australia" "France" "España" ...
## $ countries_tags : chr "en:france" "en:australia" "en:france" "en:spain" ...
## $ countries_en : chr "France" "Australia" "France" "Spain" ...
## $ ingredients_text : chr "Sucre de canne, fraises 40 g, fraises des bois 14 g, gélifiant : pectines de fruits, jus de citron concentré."| __truncated__ "" "Pulpe de pommes 50% , sucre, sirop de glucose, gélifiant : pectine, acidifiant : acide citrique, arômes, colo"| __truncated__ "Extracto de soja (78%) (agua, semillas de soja 8,3%), grasas vegetales, jarabe de glucosa, dextrosa, emulsionan"| __truncated__ ...
## $ allergens : chr "" "" "" "" ...
## $ allergens_en : logi NA NA NA NA NA NA ...
## $ traces : chr "Lait,Fruits à coque" "" "" "" ...
## $ traces_tags : chr "en:milk,en:nuts" "" "" "" ...
## $ traces_en : chr "Milk,Nuts" "" "" "" ...
## $ serving_size : chr "15 g" "" "" "" ...
## $ no_nutriments : logi NA NA NA NA NA NA ...
## $ additives_n : int 1 NA 2 5 0 NA NA 0 NA 1 ...
## $ additives : chr "[ sucre-de-canne -> fr:sucre-de-canne ] [ sucre-de -> fr:sucre-de ] [ sucre -> fr:sucre ] [ fraises-40-g "| __truncated__ "" "[ pulpe-de-pommes-50 -> fr:pulpe-de-pommes-50 ] [ pulpe-de-pommes -> fr:pulpe-de-pommes ] [ pulpe-de -> fr:"| __truncated__ "[ extracto-de-soja -> es:extracto-de-soja ] [ 78 -> es:78 ] [ agua -> es:agua ] [ semillas-de-soja-8 -> e"| __truncated__ ...
## $ additives_tags : chr "en:e440" "" "en:e440,en:e330" "en:e471,en:e415,en:e407,en:e412,en:e306" ...
## $ additives_en : chr "E440 - Pectins" "" "E440 - Pectins,E330 - Citric acid" "E471 - Mono- and diglycerides of fatty acids,E415 - Xanthan gum,E407 - Carrageenan,E412 - Guar gum,E306 - Tocop"| __truncated__ ...
## $ ingredients_from_palm_oil_n : int 0 NA 0 0 0 NA NA 0 NA 0 ...
## $ ingredients_from_palm_oil : logi NA NA NA NA NA NA ...
## $ ingredients_from_palm_oil_tags : chr "" "" "" "" ...
## $ ingredients_that_may_be_from_palm_oil_n : int 0 NA 0 1 0 NA NA 0 NA 0 ...
## $ ingredients_that_may_be_from_palm_oil : logi NA NA NA NA NA NA ...
## $ ingredients_that_may_be_from_palm_oil_tags: chr "" "" "" "e471-mono-et-diglycerides-d-acides-gras-alimentaires" ...
## $ nutrition_grade_uk : logi NA NA NA NA NA NA ...
## $ nutrition_grade_fr : chr "d" "" "" "d" ...
## $ pnns_groups_1 : chr "Sugary snacks" "Sugary snacks" "Fruits and vegetables" "unknown" ...
## $ pnns_groups_2 : chr "Sweets" "Chocolate products" "Fruits" "unknown" ...
## $ states : chr "en:to-be-checked, en:complete, en:nutrition-facts-completed, en:ingredients-completed, en:expiration-date-to-be"| __truncated__ "en:to-be-completed, en:nutrition-facts-to-be-completed, en:ingredients-to-be-completed, en:expiration-date-to-b"| __truncated__ "en:to-be-checked, en:complete, en:nutrition-facts-completed, en:ingredients-completed, en:expiration-date-to-be"| __truncated__ "en:to-be-checked, en:complete, en:nutrition-facts-completed, en:ingredients-completed, en:expiration-date-compl"| __truncated__ ...
## $ states_tags : chr "en:to-be-checked,en:complete,en:nutrition-facts-completed,en:ingredients-completed,en:expiration-date-to-be-com"| __truncated__ "en:to-be-completed,en:nutrition-facts-to-be-completed,en:ingredients-to-be-completed,en:expiration-date-to-be-c"| __truncated__ "en:to-be-checked,en:complete,en:nutrition-facts-completed,en:ingredients-completed,en:expiration-date-to-be-com"| __truncated__ "en:to-be-checked,en:complete,en:nutrition-facts-completed,en:ingredients-completed,en:expiration-date-completed"| __truncated__ ...
## $ states_en : chr "To be checked,Complete,Nutrition facts completed,Ingredients completed,Expiration date to be completed,Characte"| __truncated__ "To be completed,Nutrition facts to be completed,Ingredients to be completed,Expiration date to be completed,Cha"| __truncated__ "To be checked,Complete,Nutrition facts completed,Ingredients completed,Expiration date to be completed,Characte"| __truncated__ "To be checked,Complete,Nutrition facts completed,Ingredients completed,Expiration date completed,Characteristic"| __truncated__ ...
## $ main_category : chr "en:plant-based-foods-and-beverages" "en:sugary-snacks" "en:plant-based-foods-and-beverages" "en:plant-based-foods-and-beverages" ...
## $ main_category_en : chr "Plant-based foods and beverages" "Sugary snacks" "Plant-based foods and beverages" "Plant-based foods and beverages" ...
## $ image_url : chr "http://en.openfoodfacts.org/images/products/322/247/574/5867/front.8.400.jpg" "http://en.openfoodfacts.org/images/products/541/097/688/0110/front.7.400.jpg" "http://en.openfoodfacts.org/images/products/326/475/042/3503/front.6.400.jpg" "http://en.openfoodfacts.org/images/products/800/604/024/7001/front.7.400.jpg" ...
## $ image_small_url : chr "http://en.openfoodfacts.org/images/products/322/247/574/5867/front.8.200.jpg" "http://en.openfoodfacts.org/images/products/541/097/688/0110/front.7.200.jpg" "http://en.openfoodfacts.org/images/products/326/475/042/3503/front.6.200.jpg" "http://en.openfoodfacts.org/images/products/800/604/024/7001/front.7.200.jpg" ...
## $ energy_100g : num 918 NA NA 766 2359 ...
## $ energy_from_fat_100g : num NA NA NA NA NA NA NA NA NA NA ...
## $ fat_100g : num 0 NA NA 16.7 45.5 NA NA 25 NA 4 ...
## $ saturated_fat_100g : num 0 NA NA 9.9 5.2 NA NA 17 NA 0.54 ...
## $ butyric_acid_100g : logi NA NA NA NA NA NA ...
## $ caproic_acid_100g : logi NA NA NA NA NA NA ...
## $ caprylic_acid_100g : logi NA NA NA NA NA NA ...
## $ capric_acid_100g : logi NA NA NA NA NA NA ...
## $ lauric_acid_100g : logi NA NA NA NA NA NA ...
## $ myristic_acid_100g : logi NA NA NA NA NA NA ...
## $ palmitic_acid_100g : logi NA NA NA NA NA NA ...
## $ stearic_acid_100g : logi NA NA NA NA NA NA ...
## $ arachidic_acid_100g : logi NA NA NA NA NA NA ...
## $ behenic_acid_100g : logi NA NA NA NA NA NA ...
## $ lignoceric_acid_100g : logi NA NA NA NA NA NA ...
## $ cerotic_acid_100g : logi NA NA NA NA NA NA ...
## $ montanic_acid_100g : logi NA NA NA NA NA NA ...
## $ melissic_acid_100g : logi NA NA NA NA NA NA ...
## $ monounsaturated_fat_100g : num NA NA NA 2.9 9.5 NA NA NA NA NA ...
## $ polyunsaturated_fat_100g : num NA NA NA 3.9 32.8 NA NA NA NA NA ...
## $ omega_3_fat_100g : num NA NA NA NA NA NA NA NA NA NA ...
## $ alpha_linolenic_acid_100g : num NA NA NA NA NA NA NA NA NA NA ...
## $ eicosapentaenoic_acid_100g : num NA NA NA NA NA NA NA NA NA NA ...
## $ docosahexaenoic_acid_100g : num NA NA NA NA NA NA NA NA NA NA ...
## $ omega_6_fat_100g : num NA NA NA NA NA NA NA NA NA NA ...
## $ linoleic_acid_100g : num NA NA NA NA NA NA NA NA NA NA ...
## $ arachidonic_acid_100g : logi NA NA NA NA NA NA ...
## $ gamma_linolenic_acid_100g : logi NA NA NA NA NA NA ...
## $ dihomo_gamma_linolenic_acid_100g : logi NA NA NA NA NA NA ...
## $ omega_9_fat_100g : logi NA NA NA NA NA NA ...
## $ oleic_acid_100g : logi NA NA NA NA NA NA ...
## $ elaidic_acid_100g : logi NA NA NA NA NA NA ...
## $ gondoic_acid_100g : logi NA NA NA NA NA NA ...
## $ mead_acid_100g : logi NA NA NA NA NA NA ...
## $ erucic_acid_100g : logi NA NA NA NA NA NA ...
## [list output truncated]
head(food)
## V1 code
## 1 1 100030
## 2 2 100050
## 3 3 100079
## 4 4 100094
## 5 5 100124
## 6 6 100136
## url
## 1 http://world-en.openfoodfacts.org/product/3222475745867/confiture-de-fraise-fraise-des-bois-au-sucre-de-canne-casino-delices
## 2 http://world-en.openfoodfacts.org/product/5410976880110/guylian-sea-shells-selection
## 3 http://world-en.openfoodfacts.org/product/3264750423503/pates-de-fruits-aromatisees-jacquot
## 4 http://world-en.openfoodfacts.org/product/8006040247001/nata-vegetal-a-base-de-soja-valsoia
## 5 http://world-en.openfoodfacts.org/product/8480000340764/semillas-de-girasol-con-cascara-tostadas-aguasal-hacendado
## 6 http://world-en.openfoodfacts.org/product/0087703177727/soft-drink
## creator created_t created_datetime last_modified_t
## 1 sebleouf 1424747544 2015-02-24T03:12:24Z 1438445887
## 2 foodorigins 1450316429 2015-12-17T01:40:29Z 1450817956
## 3 domdom26 1428674916 2015-04-10T14:08:36Z 1428739289
## 4 javichu 1420416591 2015-01-05T00:09:51Z 1420417876
## 5 javichu 1420501121 2015-01-05T23:38:41Z 1445700917
## 6 foodorigins 1437983923 2015-07-27T07:58:43Z 1445577476
## last_modified_datetime
## 1 2015-08-01T16:18:07Z
## 2 2015-12-22T20:59:16Z
## 3 2015-04-11T08:01:29Z
## 4 2015-01-05T00:31:16Z
## 5 2015-10-24T15:35:17Z
## 6 2015-10-23T05:17:56Z
## product_name
## 1 Confiture de fraise fraise des bois au sucre de canne
## 2 Guylian Sea Shells Selection
## 3 Pâtes de fruits aromatisées
## 4 Nata vegetal a base de soja "Valsoia"
## 5 Semillas de girasol con cáscara tostadas aguasal
## 6 Soft Drink
## generic_name quantity
## 1 265 g
## 2 375g
## 3 Pâtes de fruits 1 kg
## 4 Nata vegetal a base de soja 200 ml
## 5 Semillas de girasol con cáscara tostadas aguasal 200 g
## 6
## packaging
## 1 Bocal,Verre
## 2 Plastic,Box
## 3 Carton,plastique
## 4 Tetra Brik
## 5 Bolsa de plástico,Envasado en atmósfera protectora
## 6
## packaging_tags
## 1 bocal,verre
## 2 plastic,box
## 3 carton,plastique
## 4 tetra-brik
## 5 bolsa-de-plastico,envasado-en-atmosfera-protectora
## 6
## brands
## 1 Casino Délices
## 2 Guylian
## 3 Jacquot
## 4 Valsoia,//Propiedad de://,Valsoia S.p.A.
## 5 Hacendado,//Propiedad de://,Mercadona S.A.
## 6
## brands_tags
## 1 casino-delices
## 2 guylian
## 3 jacquot
## 4 valsoia,propiedad-de,valsoia-s-p-a
## 5 hacendado,propiedad-de,mercadona-s-a
## 6
## categories
## 1 Aliments et boissons à base de végétaux,Aliments d'origine végétale,Aliments à base de fruits et de légumes,Petit-déjeuners,Produits à tartiner,Fruits et produits dérivés,Pâtes à tartiner végétaux,Produits à tartiner sucrés,Confitures et marmelades,Confitures,Confitures de fruits,Confitures de fruits rouges,Confitures de fraises
## 2 Chocolate
## 3 pâtes de fruits
## 4 Alimentos y bebidas de origen vegetal,Alimentos de origen vegetal,Natas vegetales,Natas vegetales a base de soja para cocinar,Natas vegetales para cocinar
## 5 Semillas de girasol y derivados, Semillas, Semillas de girasol, Semillas de girasol con cáscara, Semillas de girasol tostadas, Semillas de girasol con cáscara tostadas, Semillas de girasol con cáscara tostadas aguasal
## 6
## categories_tags
## 1 en:plant-based-foods-and-beverages,en:plant-based-foods,en:fruits-and-vegetables-based-foods,en:breakfasts,en:spreads,en:fruits-based-foods,en:plant-based-spreads,en:sweet-spreads,en:fruit-preserves,en:jams,en:fruit-jams,en:berry-jams,en:strawberry-jams
## 2 en:sugary-snacks,en:chocolates
## 3 en:plant-based-foods-and-beverages,en:plant-based-foods,en:fruits-and-vegetables-based-foods,en:sugary-snacks,en:confectioneries,en:fruits-based-foods,en:fruit-pastes
## 4 en:plant-based-foods-and-beverages,en:plant-based-foods,en:plant-based-creams,en:plant-based-creams-for-cooking,en:soy-based-creams-for-cooking
## 5 en:plant-based-foods-and-beverages,en:plant-based-foods,en:seeds,en:sunflower-seeds-and-their-products,en:sunflower-seeds,en:roasted-sunflower-seeds,en:unshelled-sunflower-seeds,en:roasted-unshelled-sunflower-seeds,es:semillas-de-girasol-con-cascara-tostadas-aguasal
## 6
## categories_en
## 1 Plant-based foods and beverages,Plant-based foods,Fruits and vegetables based foods,Breakfasts,Spreads,Fruits based foods,Plant-based spreads,Sweet spreads,Fruit preserves,Jams,Fruit jams,Berry jams,Strawberry jams
## 2 Sugary snacks,Chocolates
## 3 Plant-based foods and beverages,Plant-based foods,Fruits and vegetables based foods,Sugary snacks,Confectioneries,Fruits based foods,Fruit pastes
## 4 Plant-based foods and beverages,Plant-based foods,Plant-based creams,Plant-based creams for cooking,Soy-based creams for cooking
## 5 Plant-based foods and beverages,Plant-based foods,Seeds,Sunflower seeds and their products,Sunflower seeds,Roasted sunflower seeds,Unshelled sunflower seeds,Roasted unshelled sunflower seeds,es:Semillas-de-girasol-con-cascara-tostadas-aguasal
## 6
## origins origins_tags
## 1
## 2
## 3
## 4
## 5 Argentina argentina
## 6 South Korea south-korea
## manufacturing_places
## 1 France
## 2 Belgium
## 3
## 4 Italia
## 5 Beniparrell,Valencia (provincia),Comunidad Valenciana,España
## 6 South Korea
## manufacturing_places_tags
## 1 france
## 2 belgium
## 3
## 4 italia
## 5 beniparrell,valencia-provincia,comunidad-valenciana,espana
## 6 south-korea
## labels
## 1
## 2
## 3
## 4 Vegetariano,Vegano,Sin gluten,Sin OMG,Sin lactosa
## 5 Vegetariano,Vegano,Sin gluten
## 6
## labels_tags
## 1
## 2
## 3
## 4 en:vegetarian,en:vegan,en:gluten-free,en:no-gmos,en:no-lactose
## 5 en:vegetarian,en:vegan,en:gluten-free
## 6
## labels_en
## 1
## 2
## 3
## 4 Vegetarian,Vegan,Gluten-free,No GMOs,No lactose
## 5 Vegetarian,Vegan,Gluten-free
## 6
## emb_codes
## 1 EMB 78015
## 2
## 3
## 4
## 5 ES 21.016540/V EC,ENVASADOR:,IMPORTACO S.A.
## 6
## emb_codes_tags first_packaging_code_geo
## 1 emb-78015 48.983333,2.066667
## 2
## 3
## 4
## 5 es-21-016540-v-ec,envasador,importaco-s-a
## 6
## cities cities_tags purchase_places stores
## 1 NA andresy-yvelines-france Lyon,France Casino
## 2 NA NSW,Australia
## 3 NA France
## 4 NA Madrid,España El Corte Inglés
## 5 NA Madrid,España Mercadona
## 6 NA
## countries countries_tags countries_en
## 1 France en:france France
## 2 Australia en:australia Australia
## 3 France en:france France
## 4 España en:spain Spain
## 5 España en:spain Spain
## 6 Australia en:australia Australia
## ingredients_text
## 1 Sucre de canne, fraises 40 g, fraises des bois 14 g, gélifiant : pectines de fruits, jus de citron concentré. Préparée avec 54 g de fruits pour 100 g de produit fini.
## 2
## 3 Pulpe de pommes 50% , sucre, sirop de glucose, gélifiant : pectine, acidifiant : acide citrique, arômes, colorants naturels : extrait de paprika â complexes cuivreâchlorophyllines â curcumine â antnocyanes
## 4 Extracto de soja (78%) (agua, semillas de soja 8,3%), grasas vegetales, jarabe de glucosa, dextrosa, emulsionante: mono- y diglicéridos de ácidos grasos (E-471), sal marina, estabilizantes: goma xantana (E-415), carragenatos (E-407), goma guar (E-412); aromas, antioxidante: extractos de tocoferoles (de soja) (E-306). (Nota: el envase en italiano del paquete -que puede verse en el enlace-, especifica que el producto es 100% vegetal. Por tanto los mono- y diglicéridos de ácidos grasos (E-471) son de origen no animal).
## 5 Pipas de girasol y sal.
## 6
## allergens allergens_en traces traces_tags
## 1 NA Lait,Fruits à coque en:milk,en:nuts
## 2 NA
## 3 NA
## 4 NA
## 5 NA Frutos de cáscara,Cacahuetes en:nuts,en:peanuts
## 6 NA
## traces_en serving_size no_nutriments additives_n
## 1 Milk,Nuts 15 g NA 1
## 2 NA NA
## 3 NA 2
## 4 NA 5
## 5 Nuts,Peanuts NA 0
## 6 NA NA
## additives
## 1 [ sucre-de-canne -> fr:sucre-de-canne ] [ sucre-de -> fr:sucre-de ] [ sucre -> fr:sucre ] [ fraises-40-g -> fr:fraises-40-g ] [ fraises-40 -> fr:fraises-40 ] [ fraises -> fr:fraises ] [ fraises-des-bois-14-g -> fr:fraises-des-bois-14-g ] [ fraises-des-bois-14 -> fr:fraises-des-bois-14 ] [ fraises-des-bois -> fr:fraises-des-bois ] [ fraises-des -> fr:fraises-des ] [ fraises -> fr:fraises ] [ pectines-de-fruits -> fr:pectines-de-fruits ] [ pectines-de -> fr:pectines-de ] [ pectines -> en:e440 -> exists ] [ jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour-100-g-de-produit-fini -> fr:jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour-100-g-de-produit-fini ] [ jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour-100-g-de-produit -> fr:jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour-100-g-de-produit ] [ jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour-100-g-de -> fr:jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour-100-g-de ] [ jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour-100-g -> fr:jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour-100-g ] [ jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour-100 -> fr:jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour-100 ] [ jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour -> fr:jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour ] [ jus-de-citron-concentre-preparee-avec-54-g-de-fruits -> fr:jus-de-citron-concentre-preparee-avec-54-g-de-fruits ] [ jus-de-citron-concentre-preparee-avec-54-g-de -> fr:jus-de-citron-concentre-preparee-avec-54-g-de ] [ jus-de-citron-concentre-preparee-avec-54-g -> fr:jus-de-citron-concentre-preparee-avec-54-g ] [ jus-de-citron-concentre-preparee-avec-54 -> fr:jus-de-citron-concentre-preparee-avec-54 ] [ jus-de-citron-concentre-preparee-avec -> fr:jus-de-citron-concentre-preparee-avec ] [ jus-de-citron-concentre-preparee -> fr:jus-de-citron-concentre-preparee ] [ jus-de-citron-concentre -> fr:jus-de-citron-concentre ] [ jus-de-citron -> fr:jus-de-citron ] [ jus-de -> fr:jus-de ] [ jus -> fr:jus ]
## 2
## 3 [ pulpe-de-pommes-50 -> fr:pulpe-de-pommes-50 ] [ pulpe-de-pommes -> fr:pulpe-de-pommes ] [ pulpe-de -> fr:pulpe-de ] [ pulpe -> fr:pulpe ] [ sucre -> fr:sucre ] [ sirop-de-glucose -> fr:sirop-de-glucose ] [ sirop-de -> fr:sirop-de ] [ sirop -> fr:sirop ] [ pectine -> en:e440 -> exists ] [ acide-citrique -> en:e330 -> exists ] [ aromes -> fr:aromes ] [ naturels -> fr:naturels ] [ extrait-de-paprika-complexes-cuivre-chlorophyllines-curcumine-antnocyanes -> fr:extrait-de-paprika-complexes-cuivre-chlorophyllines-curcumine-antnocyanes ] [ extrait-de-paprika-complexes-cuivre-chlorophyllines-curcumine -> fr:extrait-de-paprika-complexes-cuivre-chlorophyllines-curcumine ] [ extrait-de-paprika-complexes-cuivre-chlorophyllines -> fr:extrait-de-paprika-complexes-cuivre-chlorophyllines ] [ extrait-de-paprika-complexes-cuivre -> fr:extrait-de-paprika-complexes-cuivre ] [ extrait-de-paprika-complexes -> fr:extrait-de-paprika-complexes ] [ extrait-de-paprika -> fr:extrait-de-paprika ] [ extrait-de -> fr:extrait-de ] [ extrait -> fr:extrait ]
## 4 [ extracto-de-soja -> es:extracto-de-soja ] [ 78 -> es:78 ] [ agua -> es:agua ] [ semillas-de-soja-8 -> es:semillas-de-soja-8 ] [ 3 -> en:fd-c ] [ grasas-vegetales -> es:grasas-vegetales ] [ jarabe-de-glucosa -> es:jarabe-de-glucosa ] [ dextrosa -> es:dextrosa ] [ emulsionante -> es:emulsionante ] [ mono-y-digliceridos-de-acidos-grasos -> en:e471 -> exists ] [ e471 -> en:e471 ] [ sal-marina -> es:sal-marina ] [ estabilizantes -> es:estabilizantes ] [ goma-xantana -> en:e415 -> exists ] [ e415 -> en:e415 ] [ carragenatos -> en:e407 -> exists ] [ e407 -> en:e407 ] [ goma-guar -> en:e412 -> exists ] [ e412 -> en:e412 ] [ aromas -> es:aromas ] [ antioxidante -> es:antioxidante ] [ extractos-de-tocoferoles -> es:extractos-de-tocoferoles ] [ de-soja -> es:de-soja ] [ e306 -> en:e306 -> exists ] [ nota -> es:nota ] [ el-envase-en-italiano-del-paquete-que-puede-verse-en-el-enlace -> es:el-envase-en-italiano-del-paquete-que-puede-verse-en-el-enlace ] [ especifica-que-el-producto-es-100-vegetal-por-tanto-los-mono-y-digliceridos-de-acidos-grasos -> es:especifica-que-el-producto-es-100-vegetal-por-tanto-los-mono-y-digliceridos-de-acidos-grasos ] [ e471 -> en:e471 ] [ son-de-origen-no-animal -> es:son-de-origen-no-animal ] [ -> es: ]
## 5 [ pipas-de-girasol-y-sal -> es:pipas-de-girasol-y-sal ]
## 6
## additives_tags
## 1 en:e440
## 2
## 3 en:e440,en:e330
## 4 en:e471,en:e415,en:e407,en:e412,en:e306
## 5
## 6
## additives_en
## 1 E440 - Pectins
## 2
## 3 E440 - Pectins,E330 - Citric acid
## 4 E471 - Mono- and diglycerides of fatty acids,E415 - Xanthan gum,E407 - Carrageenan,E412 - Guar gum,E306 - Tocopherol-rich extract
## 5
## 6
## ingredients_from_palm_oil_n ingredients_from_palm_oil
## 1 0 NA
## 2 NA NA
## 3 0 NA
## 4 0 NA
## 5 0 NA
## 6 NA NA
## ingredients_from_palm_oil_tags ingredients_that_may_be_from_palm_oil_n
## 1 0
## 2 NA
## 3 0
## 4 1
## 5 0
## 6 NA
## ingredients_that_may_be_from_palm_oil
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## ingredients_that_may_be_from_palm_oil_tags nutrition_grade_uk
## 1 NA
## 2 NA
## 3 NA
## 4 e471-mono-et-diglycerides-d-acides-gras-alimentaires NA
## 5 NA
## 6 NA
## nutrition_grade_fr pnns_groups_1 pnns_groups_2
## 1 d Sugary snacks Sweets
## 2 Sugary snacks Chocolate products
## 3 Fruits and vegetables Fruits
## 4 d unknown unknown
## 5 d unknown unknown
## 6 unknown unknown
## states
## 1 en:to-be-checked, en:complete, en:nutrition-facts-completed, en:ingredients-completed, en:expiration-date-to-be-completed, en:characteristics-completed, en:photos-validated, en:photos-uploaded
## 2 en:to-be-completed, en:nutrition-facts-to-be-completed, en:ingredients-to-be-completed, en:expiration-date-to-be-completed, en:characteristics-completed, en:photos-validated, en:photos-uploaded
## 3 en:to-be-checked, en:complete, en:nutrition-facts-completed, en:ingredients-completed, en:expiration-date-to-be-completed, en:characteristics-completed, en:photos-validated, en:photos-uploaded
## 4 en:to-be-checked, en:complete, en:nutrition-facts-completed, en:ingredients-completed, en:expiration-date-completed, en:characteristics-completed, en:photos-validated, en:photos-uploaded
## 5 en:to-be-checked, en:complete, en:nutrition-facts-completed, en:ingredients-completed, en:expiration-date-completed, en:characteristics-completed, en:photos-validated, en:photos-uploaded
## 6 en:to-be-completed, en:nutrition-facts-to-be-completed, en:ingredients-to-be-completed, en:expiration-date-to-be-completed, en:characteristics-to-be-completed, en:categories-to-be-completed, en:brands-to-be-completed, en:packaging-to-be-completed, en:quantity-to-be-completed, en:photos-to-be-validated, en:photos-uploaded
## states_tags
## 1 en:to-be-checked,en:complete,en:nutrition-facts-completed,en:ingredients-completed,en:expiration-date-to-be-completed,en:characteristics-completed,en:photos-validated,en:photos-uploaded
## 2 en:to-be-completed,en:nutrition-facts-to-be-completed,en:ingredients-to-be-completed,en:expiration-date-to-be-completed,en:characteristics-completed,en:photos-validated,en:photos-uploaded
## 3 en:to-be-checked,en:complete,en:nutrition-facts-completed,en:ingredients-completed,en:expiration-date-to-be-completed,en:characteristics-completed,en:photos-validated,en:photos-uploaded
## 4 en:to-be-checked,en:complete,en:nutrition-facts-completed,en:ingredients-completed,en:expiration-date-completed,en:characteristics-completed,en:photos-validated,en:photos-uploaded
## 5 en:to-be-checked,en:complete,en:nutrition-facts-completed,en:ingredients-completed,en:expiration-date-completed,en:characteristics-completed,en:photos-validated,en:photos-uploaded
## 6 en:to-be-completed,en:nutrition-facts-to-be-completed,en:ingredients-to-be-completed,en:expiration-date-to-be-completed,en:characteristics-to-be-completed,en:categories-to-be-completed,en:brands-to-be-completed,en:packaging-to-be-completed,en:quantity-to-be-completed,en:photos-to-be-validated,en:photos-uploaded
## states_en
## 1 To be checked,Complete,Nutrition facts completed,Ingredients completed,Expiration date to be completed,Characteristics completed,Photos validated,Photos uploaded
## 2 To be completed,Nutrition facts to be completed,Ingredients to be completed,Expiration date to be completed,Characteristics completed,Photos validated,Photos uploaded
## 3 To be checked,Complete,Nutrition facts completed,Ingredients completed,Expiration date to be completed,Characteristics completed,Photos validated,Photos uploaded
## 4 To be checked,Complete,Nutrition facts completed,Ingredients completed,Expiration date completed,Characteristics completed,Photos validated,Photos uploaded
## 5 To be checked,Complete,Nutrition facts completed,Ingredients completed,Expiration date completed,Characteristics completed,Photos validated,Photos uploaded
## 6 To be completed,Nutrition facts to be completed,Ingredients to be completed,Expiration date to be completed,Characteristics to be completed,Categories to be completed,Brands to be completed,Packaging to be completed,Quantity to be completed,Photos to be validated,Photos uploaded
## main_category main_category_en
## 1 en:plant-based-foods-and-beverages Plant-based foods and beverages
## 2 en:sugary-snacks Sugary snacks
## 3 en:plant-based-foods-and-beverages Plant-based foods and beverages
## 4 en:plant-based-foods-and-beverages Plant-based foods and beverages
## 5 en:plant-based-foods-and-beverages Plant-based foods and beverages
## 6
## image_url
## 1 http://en.openfoodfacts.org/images/products/322/247/574/5867/front.8.400.jpg
## 2 http://en.openfoodfacts.org/images/products/541/097/688/0110/front.7.400.jpg
## 3 http://en.openfoodfacts.org/images/products/326/475/042/3503/front.6.400.jpg
## 4 http://en.openfoodfacts.org/images/products/800/604/024/7001/front.7.400.jpg
## 5 http://en.openfoodfacts.org/images/products/848/000/034/0764/front.6.400.jpg
## 6 http://en.openfoodfacts.org/images/products/008/770/317/7727/front.8.400.jpg
## image_small_url
## 1 http://en.openfoodfacts.org/images/products/322/247/574/5867/front.8.200.jpg
## 2 http://en.openfoodfacts.org/images/products/541/097/688/0110/front.7.200.jpg
## 3 http://en.openfoodfacts.org/images/products/326/475/042/3503/front.6.200.jpg
## 4 http://en.openfoodfacts.org/images/products/800/604/024/7001/front.7.200.jpg
## 5 http://en.openfoodfacts.org/images/products/848/000/034/0764/front.6.200.jpg
## 6 http://en.openfoodfacts.org/images/products/008/770/317/7727/front.8.200.jpg
## energy_100g energy_from_fat_100g fat_100g saturated_fat_100g
## 1 918 NA 0.0 0.0
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 766 NA 16.7 9.9
## 5 2359 NA 45.5 5.2
## 6 NA NA NA NA
## butyric_acid_100g caproic_acid_100g caprylic_acid_100g capric_acid_100g
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## lauric_acid_100g myristic_acid_100g palmitic_acid_100g stearic_acid_100g
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## arachidic_acid_100g behenic_acid_100g lignoceric_acid_100g
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## cerotic_acid_100g montanic_acid_100g melissic_acid_100g
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## monounsaturated_fat_100g polyunsaturated_fat_100g omega_3_fat_100g
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 2.9 3.9 NA
## 5 9.5 32.8 NA
## 6 NA NA NA
## alpha_linolenic_acid_100g eicosapentaenoic_acid_100g
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## docosahexaenoic_acid_100g omega_6_fat_100g linoleic_acid_100g
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## arachidonic_acid_100g gamma_linolenic_acid_100g
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## dihomo_gamma_linolenic_acid_100g omega_9_fat_100g oleic_acid_100g
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## elaidic_acid_100g gondoic_acid_100g mead_acid_100g erucic_acid_100g
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## nervonic_acid_100g trans_fat_100g cholesterol_100g carbohydrates_100g
## 1 NA NA NA 54.0
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA 2e-04 5.7
## 5 NA NA NA 17.3
## 6 NA NA NA NA
## sugars_100g sucrose_100g glucose_100g fructose_100g lactose_100g
## 1 54.0 NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 4.2 NA NA NA NA
## 5 2.7 NA NA NA NA
## 6 NA NA NA NA NA
## maltose_100g maltodextrins_100g starch_100g polyols_100g fiber_100g
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA 0.2
## 5 NA NA NA NA 9.0
## 6 NA NA NA NA NA
## proteins_100g casein_100g serum_proteins_100g nucleotides_100g salt_100g
## 1 0.0 NA NA NA 0.0000
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 2.9 NA NA NA 0.0508
## 5 18.2 NA NA NA 3.9878
## 6 NA NA NA NA NA
## sodium_100g alcohol_100g vitamin_a_100g beta_carotene_100g
## 1 0.00 NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 0.02 NA NA NA
## 5 1.57 NA NA NA
## 6 NA NA NA NA
## vitamin_d_100g vitamin_e_100g vitamin_k_100g vitamin_c_100g
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## vitamin_b1_100g vitamin_b2_100g vitamin_pp_100g vitamin_b6_100g
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## vitamin_b9_100g vitamin_b12_100g biotin_100g pantothenic_acid_100g
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## silica_100g bicarbonate_100g potassium_100g chloride_100g calcium_100g
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## phosphorus_100g iron_100g magnesium_100g zinc_100g copper_100g
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 1.155 0.0038 0.129 NA NA
## 6 NA NA NA NA NA
## manganese_100g fluoride_100g selenium_100g chromium_100g molybdenum_100g
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## iodine_100g caffeine_100g taurine_100g ph_100g
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## fruits_vegetables_nuts_100g collagen_meat_protein_ratio_100g cocoa_100g
## 1 54 NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## chlorophyl_100g carbon_footprint_100g nutrition_score_fr_100g
## 1 NA NA 11
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA 11
## 5 NA NA 17
## 6 NA NA NA
## nutrition_score_uk_100g
## 1 11
## 2 NA
## 3 NA
## 4 11
## 5 17
## 6 NA
summary(food)
## V1 code url creator
## Min. : 1.0 Min. :100030 Length:1500 Length:1500
## 1st Qu.: 375.8 1st Qu.:124975 Class :character Class :character
## Median : 750.5 Median :149514 Mode :character Mode :character
## Mean : 750.5 Mean :149613
## 3rd Qu.:1125.2 3rd Qu.:174506
## Max. :1500.0 Max. :199880
##
## created_t created_datetime last_modified_t
## Min. :1.332e+09 Length:1500 Min. :1.340e+09
## 1st Qu.:1.394e+09 Class :character 1st Qu.:1.424e+09
## Median :1.425e+09 Mode :character Median :1.437e+09
## Mean :1.414e+09 Mean :1.430e+09
## 3rd Qu.:1.436e+09 3rd Qu.:1.446e+09
## Max. :1.453e+09 Max. :1.453e+09
##
## last_modified_datetime product_name generic_name
## Length:1500 Length:1500 Length:1500
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## quantity packaging packaging_tags
## Length:1500 Length:1500 Length:1500
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## brands brands_tags categories
## Length:1500 Length:1500 Length:1500
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## categories_tags categories_en origins
## Length:1500 Length:1500 Length:1500
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## origins_tags manufacturing_places manufacturing_places_tags
## Length:1500 Length:1500 Length:1500
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## labels labels_tags labels_en
## Length:1500 Length:1500 Length:1500
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## emb_codes emb_codes_tags first_packaging_code_geo
## Length:1500 Length:1500 Length:1500
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## cities cities_tags purchase_places stores
## Mode:logical Length:1500 Length:1500 Length:1500
## NA's:1500 Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## countries countries_tags countries_en
## Length:1500 Length:1500 Length:1500
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## ingredients_text allergens allergens_en traces
## Length:1500 Length:1500 Mode:logical Length:1500
## Class :character Class :character NA's:1500 Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## traces_tags traces_en serving_size no_nutriments
## Length:1500 Length:1500 Length:1500 Mode:logical
## Class :character Class :character Class :character NA's:1500
## Mode :character Mode :character Mode :character
##
##
##
##
## additives_n additives additives_tags additives_en
## Min. : 0.000 Length:1500 Length:1500 Length:1500
## 1st Qu.: 0.000 Class :character Class :character Class :character
## Median : 1.000 Mode :character Mode :character Mode :character
## Mean : 1.846
## 3rd Qu.: 3.000
## Max. :17.000
## NA's :514
## ingredients_from_palm_oil_n ingredients_from_palm_oil
## Min. :0.0000 Mode:logical
## 1st Qu.:0.0000 NA's:1500
## Median :0.0000
## Mean :0.0487
## 3rd Qu.:0.0000
## Max. :1.0000
## NA's :514
## ingredients_from_palm_oil_tags ingredients_that_may_be_from_palm_oil_n
## Length:1500 Min. :0.0000
## Class :character 1st Qu.:0.0000
## Mode :character Median :0.0000
## Mean :0.1379
## 3rd Qu.:0.0000
## Max. :4.0000
## NA's :514
## ingredients_that_may_be_from_palm_oil
## Mode:logical
## NA's:1500
##
##
##
##
##
## ingredients_that_may_be_from_palm_oil_tags nutrition_grade_uk
## Length:1500 Mode:logical
## Class :character NA's:1500
## Mode :character
##
##
##
##
## nutrition_grade_fr pnns_groups_1 pnns_groups_2
## Length:1500 Length:1500 Length:1500
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## states states_tags states_en
## Length:1500 Length:1500 Length:1500
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## main_category main_category_en image_url
## Length:1500 Length:1500 Length:1500
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## image_small_url energy_100g energy_from_fat_100g fat_100g
## Length:1500 Min. : 0.0 Min. : 0.00 Min. : 0.00
## Class :character 1st Qu.: 369.8 1st Qu.: 35.98 1st Qu.: 0.90
## Mode :character Median : 966.5 Median : 237.00 Median : 6.00
## Mean :1083.2 Mean : 668.41 Mean : 13.39
## 3rd Qu.:1641.5 3rd Qu.: 974.00 3rd Qu.: 20.00
## Max. :3700.0 Max. :2900.00 Max. :100.00
## NA's :700 NA's :1486 NA's :708
## saturated_fat_100g butyric_acid_100g caproic_acid_100g caprylic_acid_100g
## Min. : 0.000 Mode:logical Mode:logical Mode:logical
## 1st Qu.: 0.200 NA's:1500 NA's:1500 NA's:1500
## Median : 1.700
## Mean : 4.874
## 3rd Qu.: 6.500
## Max. :57.000
## NA's :797
## capric_acid_100g lauric_acid_100g myristic_acid_100g palmitic_acid_100g
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:1500 NA's:1500 NA's:1500 NA's:1500
##
##
##
##
##
## stearic_acid_100g arachidic_acid_100g behenic_acid_100g
## Mode:logical Mode:logical Mode:logical
## NA's:1500 NA's:1500 NA's:1500
##
##
##
##
##
## lignoceric_acid_100g cerotic_acid_100g montanic_acid_100g
## Mode:logical Mode:logical Mode:logical
## NA's:1500 NA's:1500 NA's:1500
##
##
##
##
##
## melissic_acid_100g monounsaturated_fat_100g polyunsaturated_fat_100g
## Mode:logical Min. : 0.00 Min. : 0.400
## NA's:1500 1st Qu.: 3.87 1st Qu.: 1.653
## Median : 9.50 Median : 3.900
## Mean :19.77 Mean : 9.986
## 3rd Qu.:29.00 3rd Qu.:12.700
## Max. :75.00 Max. :46.200
## NA's :1465 NA's :1464
## omega_3_fat_100g alpha_linolenic_acid_100g eicosapentaenoic_acid_100g
## Min. : 0.033 Min. :0.0800 Min. :0.721
## 1st Qu.: 1.300 1st Qu.:0.0905 1st Qu.:0.721
## Median : 3.000 Median :0.1010 Median :0.721
## Mean : 3.726 Mean :0.1737 Mean :0.721
## 3rd Qu.: 3.200 3rd Qu.:0.2205 3rd Qu.:0.721
## Max. :12.400 Max. :0.3400 Max. :0.721
## NA's :1491 NA's :1497 NA's :1499
## docosahexaenoic_acid_100g omega_6_fat_100g linoleic_acid_100g
## Min. :1.09 Min. :0.25 Min. :0.5000
## 1st Qu.:1.09 1st Qu.:0.25 1st Qu.:0.5165
## Median :1.09 Median :0.25 Median :0.5330
## Mean :1.09 Mean :0.25 Mean :0.5330
## 3rd Qu.:1.09 3rd Qu.:0.25 3rd Qu.:0.5495
## Max. :1.09 Max. :0.25 Max. :0.5660
## NA's :1499 NA's :1499 NA's :1498
## arachidonic_acid_100g gamma_linolenic_acid_100g
## Mode:logical Mode:logical
## NA's:1500 NA's:1500
##
##
##
##
##
## dihomo_gamma_linolenic_acid_100g omega_9_fat_100g oleic_acid_100g
## Mode:logical Mode:logical Mode:logical
## NA's:1500 NA's:1500 NA's:1500
##
##
##
##
##
## elaidic_acid_100g gondoic_acid_100g mead_acid_100g erucic_acid_100g
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:1500 NA's:1500 NA's:1500 NA's:1500
##
##
##
##
##
## nervonic_acid_100g trans_fat_100g cholesterol_100g carbohydrates_100g
## Mode:logical Min. :0.0000 Min. :0.0000 Min. : 0.000
## NA's:1500 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 3.792
## Median :0.0000 Median :0.0000 Median : 13.500
## Mean :0.0105 Mean :0.0265 Mean : 27.958
## 3rd Qu.:0.0000 3rd Qu.:0.0026 3rd Qu.: 55.000
## Max. :0.1000 Max. :0.4300 Max. :100.000
## NA's :1481 NA's :1477 NA's :708
## sugars_100g sucrose_100g glucose_100g fructose_100g
## Min. : 0.00 Mode:logical Mode:logical Min. :100
## 1st Qu.: 1.00 NA's:1500 NA's:1500 1st Qu.:100
## Median : 4.05 Median :100
## Mean : 12.66 Mean :100
## 3rd Qu.: 14.70 3rd Qu.:100
## Max. :100.00 Max. :100
## NA's :788 NA's :1499
## lactose_100g maltose_100g maltodextrins_100g starch_100g
## Min. :0.000 Mode:logical Mode:logical Min. : 0.00
## 1st Qu.:0.250 NA's:1500 NA's:1500 1st Qu.: 9.45
## Median :0.500 Median :39.50
## Mean :2.933 Mean :30.73
## 3rd Qu.:4.400 3rd Qu.:42.85
## Max. :8.300 Max. :71.00
## NA's :1497 NA's :1493
## polyols_100g fiber_100g proteins_100g casein_100g
## Min. : 8.60 Min. : 0.000 Min. : 0.000 Min. :1.1
## 1st Qu.:59.10 1st Qu.: 0.500 1st Qu.: 1.500 1st Qu.:1.1
## Median :67.00 Median : 1.750 Median : 6.000 Median :1.1
## Mean :56.06 Mean : 2.823 Mean : 7.563 Mean :1.1
## 3rd Qu.:69.80 3rd Qu.: 3.500 3rd Qu.:10.675 3rd Qu.:1.1
## Max. :70.00 Max. :46.700 Max. :61.000 Max. :1.1
## NA's :1491 NA's :994 NA's :710 NA's :1499
## serum_proteins_100g nucleotides_100g salt_100g sodium_100g
## Mode:logical Mode:logical Min. : 0.0000 Min. : 0.0000
## NA's:1500 NA's:1500 1st Qu.: 0.0438 1st Qu.: 0.0172
## Median : 0.4498 Median : 0.1771
## Mean : 1.1205 Mean : 0.4409
## 3rd Qu.: 1.1938 3rd Qu.: 0.4700
## Max. :102.0000 Max. :40.0000
## NA's :780 NA's :780
## alcohol_100g vitamin_a_100g beta_carotene_100g vitamin_d_100g
## Min. : 0.00 Min. :0.0000 Mode:logical Min. :0e+00
## 1st Qu.: 0.00 1st Qu.:0.0000 NA's:1500 1st Qu.:0e+00
## Median : 5.50 Median :0.0001 Median :0e+00
## Mean :10.07 Mean :0.0003 Mean :0e+00
## 3rd Qu.:13.00 3rd Qu.:0.0006 3rd Qu.:0e+00
## Max. :50.00 Max. :0.0013 Max. :1e-04
## NA's :1433 NA's :1477 NA's :1485
## vitamin_e_100g vitamin_k_100g vitamin_c_100g vitamin_b1_100g
## Min. :0.0005 Min. :0 Min. :0.000 Min. :0.0001
## 1st Qu.:0.0021 1st Qu.:0 1st Qu.:0.002 1st Qu.:0.0003
## Median :0.0044 Median :0 Median :0.019 Median :0.0004
## Mean :0.0069 Mean :0 Mean :0.025 Mean :0.0006
## 3rd Qu.:0.0097 3rd Qu.:0 3rd Qu.:0.030 3rd Qu.:0.0010
## Max. :0.0320 Max. :0 Max. :0.217 Max. :0.0013
## NA's :1478 NA's :1498 NA's :1459 NA's :1478
## vitamin_b2_100g vitamin_pp_100g vitamin_b6_100g vitamin_b9_100g
## Min. :0.0002 Min. :0.0006 Min. :0.0001 Min. :0e+00
## 1st Qu.:0.0003 1st Qu.:0.0033 1st Qu.:0.0002 1st Qu.:0e+00
## Median :0.0009 Median :0.0069 Median :0.0008 Median :1e-04
## Mean :0.0011 Mean :0.0086 Mean :0.0112 Mean :1e-04
## 3rd Qu.:0.0013 3rd Qu.:0.0140 3rd Qu.:0.0012 3rd Qu.:2e-04
## Max. :0.0066 Max. :0.0160 Max. :0.2000 Max. :2e-04
## NA's :1483 NA's :1484 NA's :1481 NA's :1483
## vitamin_b12_100g biotin_100g pantothenic_acid_100g silica_100g
## Min. :0 Min. :0 Min. :0.0000 Min. :8e-04
## 1st Qu.:0 1st Qu.:0 1st Qu.:0.0007 1st Qu.:8e-04
## Median :0 Median :0 Median :0.0020 Median :8e-04
## Mean :0 Mean :0 Mean :0.0027 Mean :8e-04
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0.0051 3rd Qu.:8e-04
## Max. :0 Max. :0 Max. :0.0060 Max. :8e-04
## NA's :1489 NA's :1498 NA's :1486 NA's :1499
## bicarbonate_100g potassium_100g chloride_100g calcium_100g
## Min. :0.0006 Min. :0.0000 Min. :0.0003 Min. :0.0000
## 1st Qu.:0.0678 1st Qu.:0.0650 1st Qu.:0.0006 1st Qu.:0.0450
## Median :0.1350 Median :0.1940 Median :0.0009 Median :0.1200
## Mean :0.1692 Mean :0.3288 Mean :0.0144 Mean :0.2040
## 3rd Qu.:0.2535 3rd Qu.:0.3670 3rd Qu.:0.0214 3rd Qu.:0.1985
## Max. :0.3720 Max. :1.4300 Max. :0.0420 Max. :1.0000
## NA's :1497 NA's :1487 NA's :1497 NA's :1449
## phosphorus_100g iron_100g magnesium_100g zinc_100g
## Min. :0.0430 Min. :0.0000 Min. :0.0000 Min. :0.0005
## 1st Qu.:0.1938 1st Qu.:0.0012 1st Qu.:0.0670 1st Qu.:0.0009
## Median :0.3185 Median :0.0042 Median :0.1040 Median :0.0017
## Mean :0.3777 Mean :0.0045 Mean :0.1066 Mean :0.0016
## 3rd Qu.:0.4340 3rd Qu.:0.0077 3rd Qu.:0.1300 3rd Qu.:0.0022
## Max. :1.1550 Max. :0.0137 Max. :0.3330 Max. :0.0026
## NA's :1488 NA's :1463 NA's :1479 NA's :1493
## copper_100g manganese_100g fluoride_100g selenium_100g
## Min. :0e+00 Min. :0 Min. :0 Min. :0
## 1st Qu.:1e-04 1st Qu.:0 1st Qu.:0 1st Qu.:0
## Median :1e-04 Median :0 Median :0 Median :0
## Mean :1e-04 Mean :0 Mean :0 Mean :0
## 3rd Qu.:1e-04 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :1e-04 Max. :0 Max. :0 Max. :0
## NA's :1498 NA's :1499 NA's :1498 NA's :1499
## chromium_100g molybdenum_100g iodine_100g caffeine_100g
## Mode:logical Mode:logical Min. :0 Mode:logical
## NA's:1500 NA's:1500 1st Qu.:0 NA's:1500
## Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
## NA's :1499
## taurine_100g ph_100g fruits_vegetables_nuts_100g
## Mode:logical Mode:logical Min. : 2.00
## NA's:1500 NA's:1500 1st Qu.:11.25
## Median :42.00
## Mean :36.88
## 3rd Qu.:52.25
## Max. :80.00
## NA's :1470
## collagen_meat_protein_ratio_100g cocoa_100g chlorophyl_100g
## Min. :12.00 Min. :30 Mode:logical
## 1st Qu.:13.50 1st Qu.:47 NA's:1500
## Median :15.00 Median :60
## Mean :15.67 Mean :57
## 3rd Qu.:17.50 3rd Qu.:70
## Max. :20.00 Max. :81
## NA's :1497 NA's :1491
## carbon_footprint_100g nutrition_score_fr_100g nutrition_score_uk_100g
## Min. : 12.00 Min. :-12.000 Min. :-12.000
## 1st Qu.: 97.42 1st Qu.: 1.000 1st Qu.: 0.000
## Median :182.85 Median : 7.000 Median : 6.000
## Mean :131.18 Mean : 7.941 Mean : 7.631
## 3rd Qu.:190.78 3rd Qu.: 15.000 3rd Qu.: 16.000
## Max. :198.70 Max. : 28.000 Max. : 28.000
## NA's :1497 NA's :825 NA's :825
# Conclusion: hard to conclude anything with so many columns (160).
# Use glimpse() or names() for a more concise look.
library(dplyr)
glimpse(food)
## Observations: 1,500
## Variables: 160
## $ V1 <int> 1, 2, 3, 4, 5, 6, 7...
## $ code <int> 100030, 100050, 100...
## $ url <chr> "http://world-en.op...
## $ creator <chr> "sebleouf", "foodor...
## $ created_t <int> 1424747544, 1450316...
## $ created_datetime <chr> "2015-02-24T03:12:2...
## $ last_modified_t <int> 1438445887, 1450817...
## $ last_modified_datetime <chr> "2015-08-01T16:18:0...
## $ product_name <chr> "Confiture de frais...
## $ generic_name <chr> "", "", "Pâtes de ...
## $ quantity <chr> "265 g", "375g", "1...
## $ packaging <chr> "Bocal,Verre", "Pla...
## $ packaging_tags <chr> "bocal,verre", "pla...
## $ brands <chr> "Casino Délices", ...
## $ brands_tags <chr> "casino-delices", "...
## $ categories <chr> "Aliments et boisso...
## $ categories_tags <chr> "en:plant-based-foo...
## $ categories_en <chr> "Plant-based foods ...
## $ origins <chr> "", "", "", "", "Ar...
## $ origins_tags <chr> "", "", "", "", "ar...
## $ manufacturing_places <chr> "France", "Belgium"...
## $ manufacturing_places_tags <chr> "france", "belgium"...
## $ labels <chr> "", "", "", "Vegeta...
## $ labels_tags <chr> "", "", "", "en:veg...
## $ labels_en <chr> "", "", "", "Vegeta...
## $ emb_codes <chr> "EMB 78015", "", ""...
## $ emb_codes_tags <chr> "emb-78015", "", ""...
## $ first_packaging_code_geo <chr> "48.983333,2.066667...
## $ cities <lgl> NA, NA, NA, NA, NA,...
## $ cities_tags <chr> "andresy-yvelines-f...
## $ purchase_places <chr> "Lyon,France", "NSW...
## $ stores <chr> "Casino", "", "", "...
## $ countries <chr> "France", "Australi...
## $ countries_tags <chr> "en:france", "en:au...
## $ countries_en <chr> "France", "Australi...
## $ ingredients_text <chr> "Sucre de canne, fr...
## $ allergens <chr> "", "", "", "", "",...
## $ allergens_en <lgl> NA, NA, NA, NA, NA,...
## $ traces <chr> "Lait,Fruits à coq...
## $ traces_tags <chr> "en:milk,en:nuts", ...
## $ traces_en <chr> "Milk,Nuts", "", ""...
## $ serving_size <chr> "15 g", "", "", "",...
## $ no_nutriments <lgl> NA, NA, NA, NA, NA,...
## $ additives_n <int> 1, NA, 2, 5, 0, NA,...
## $ additives <chr> "[ sucre-de-canne -...
## $ additives_tags <chr> "en:e440", "", "en:...
## $ additives_en <chr> "E440 - Pectins", "...
## $ ingredients_from_palm_oil_n <int> 0, NA, 0, 0, 0, NA,...
## $ ingredients_from_palm_oil <lgl> NA, NA, NA, NA, NA,...
## $ ingredients_from_palm_oil_tags <chr> "", "", "", "", "",...
## $ ingredients_that_may_be_from_palm_oil_n <int> 0, NA, 0, 1, 0, NA,...
## $ ingredients_that_may_be_from_palm_oil <lgl> NA, NA, NA, NA, NA,...
## $ ingredients_that_may_be_from_palm_oil_tags <chr> "", "", "", "e471-m...
## $ nutrition_grade_uk <lgl> NA, NA, NA, NA, NA,...
## $ nutrition_grade_fr <chr> "d", "", "", "d", "...
## $ pnns_groups_1 <chr> "Sugary snacks", "S...
## $ pnns_groups_2 <chr> "Sweets", "Chocolat...
## $ states <chr> "en:to-be-checked, ...
## $ states_tags <chr> "en:to-be-checked,e...
## $ states_en <chr> "To be checked,Comp...
## $ main_category <chr> "en:plant-based-foo...
## $ main_category_en <chr> "Plant-based foods ...
## $ image_url <chr> "http://en.openfood...
## $ image_small_url <chr> "http://en.openfood...
## $ energy_100g <dbl> 918, NA, NA, 766, 2...
## $ energy_from_fat_100g <dbl> NA, NA, NA, NA, NA,...
## $ fat_100g <dbl> 0.00, NA, NA, 16.70...
## $ saturated_fat_100g <dbl> 0.000, NA, NA, 9.90...
## $ butyric_acid_100g <lgl> NA, NA, NA, NA, NA,...
## $ caproic_acid_100g <lgl> NA, NA, NA, NA, NA,...
## $ caprylic_acid_100g <lgl> NA, NA, NA, NA, NA,...
## $ capric_acid_100g <lgl> NA, NA, NA, NA, NA,...
## $ lauric_acid_100g <lgl> NA, NA, NA, NA, NA,...
## $ myristic_acid_100g <lgl> NA, NA, NA, NA, NA,...
## $ palmitic_acid_100g <lgl> NA, NA, NA, NA, NA,...
## $ stearic_acid_100g <lgl> NA, NA, NA, NA, NA,...
## $ arachidic_acid_100g <lgl> NA, NA, NA, NA, NA,...
## $ behenic_acid_100g <lgl> NA, NA, NA, NA, NA,...
## $ lignoceric_acid_100g <lgl> NA, NA, NA, NA, NA,...
## $ cerotic_acid_100g <lgl> NA, NA, NA, NA, NA,...
## $ montanic_acid_100g <lgl> NA, NA, NA, NA, NA,...
## $ melissic_acid_100g <lgl> NA, NA, NA, NA, NA,...
## $ monounsaturated_fat_100g <dbl> NA, NA, NA, 2.9, 9....
## $ polyunsaturated_fat_100g <dbl> NA, NA, NA, 3.9, 32...
## $ omega_3_fat_100g <dbl> NA, NA, NA, NA, NA,...
## $ alpha_linolenic_acid_100g <dbl> NA, NA, NA, NA, NA,...
## $ eicosapentaenoic_acid_100g <dbl> NA, NA, NA, NA, NA,...
## $ docosahexaenoic_acid_100g <dbl> NA, NA, NA, NA, NA,...
## $ omega_6_fat_100g <dbl> NA, NA, NA, NA, NA,...
## $ linoleic_acid_100g <dbl> NA, NA, NA, NA, NA,...
## $ arachidonic_acid_100g <lgl> NA, NA, NA, NA, NA,...
## $ gamma_linolenic_acid_100g <lgl> NA, NA, NA, NA, NA,...
## $ dihomo_gamma_linolenic_acid_100g <lgl> NA, NA, NA, NA, NA,...
## $ omega_9_fat_100g <lgl> NA, NA, NA, NA, NA,...
## $ oleic_acid_100g <lgl> NA, NA, NA, NA, NA,...
## $ elaidic_acid_100g <lgl> NA, NA, NA, NA, NA,...
## $ gondoic_acid_100g <lgl> NA, NA, NA, NA, NA,...
## $ mead_acid_100g <lgl> NA, NA, NA, NA, NA,...
## $ erucic_acid_100g <lgl> NA, NA, NA, NA, NA,...
## $ nervonic_acid_100g <lgl> NA, NA, NA, NA, NA,...
## $ trans_fat_100g <dbl> NA, NA, NA, NA, NA,...
## $ cholesterol_100g <dbl> NA, NA, NA, 0.00020...
## $ carbohydrates_100g <dbl> 54.00, NA, NA, 5.70...
## $ sugars_100g <dbl> 54.00, NA, NA, 4.20...
## $ sucrose_100g <lgl> NA, NA, NA, NA, NA,...
## $ glucose_100g <lgl> NA, NA, NA, NA, NA,...
## $ fructose_100g <int> NA, NA, NA, NA, NA,...
## $ lactose_100g <dbl> NA, NA, NA, NA, NA,...
## $ maltose_100g <lgl> NA, NA, NA, NA, NA,...
## $ maltodextrins_100g <lgl> NA, NA, NA, NA, NA,...
## $ starch_100g <dbl> NA, NA, NA, NA, NA,...
## $ polyols_100g <dbl> NA, NA, NA, NA, NA,...
## $ fiber_100g <dbl> NA, NA, NA, 0.2, 9....
## $ proteins_100g <dbl> 0.00, NA, NA, 2.90,...
## $ casein_100g <dbl> NA, NA, NA, NA, NA,...
## $ serum_proteins_100g <lgl> NA, NA, NA, NA, NA,...
## $ nucleotides_100g <lgl> NA, NA, NA, NA, NA,...
## $ salt_100g <dbl> 0.0000000, NA, NA, ...
## $ sodium_100g <dbl> 0.0000000, NA, NA, ...
## $ alcohol_100g <dbl> NA, NA, NA, NA, NA,...
## $ vitamin_a_100g <dbl> NA, NA, NA, NA, NA,...
## $ beta_carotene_100g <lgl> NA, NA, NA, NA, NA,...
## $ vitamin_d_100g <dbl> NA, NA, NA, NA, NA,...
## $ vitamin_e_100g <dbl> NA, NA, NA, NA, NA,...
## $ vitamin_k_100g <dbl> NA, NA, NA, NA, NA,...
## $ vitamin_c_100g <dbl> NA, NA, NA, NA, NA,...
## $ vitamin_b1_100g <dbl> NA, NA, NA, NA, NA,...
## $ vitamin_b2_100g <dbl> NA, NA, NA, NA, NA,...
## $ vitamin_pp_100g <dbl> NA, NA, NA, NA, NA,...
## $ vitamin_b6_100g <dbl> NA, NA, NA, NA, NA,...
## $ vitamin_b9_100g <dbl> NA, NA, NA, NA, NA,...
## $ vitamin_b12_100g <dbl> NA, NA, NA, NA, NA,...
## $ biotin_100g <dbl> NA, NA, NA, NA, NA,...
## $ pantothenic_acid_100g <dbl> NA, NA, NA, NA, NA,...
## $ silica_100g <dbl> NA, NA, NA, NA, NA,...
## $ bicarbonate_100g <dbl> NA, NA, NA, NA, NA,...
## $ potassium_100g <dbl> NA, NA, NA, NA, NA,...
## $ chloride_100g <dbl> NA, NA, NA, NA, NA,...
## $ calcium_100g <dbl> NA, NA, NA, NA, NA,...
## $ phosphorus_100g <dbl> NA, NA, NA, NA, 1.1...
## $ iron_100g <dbl> NA, NA, NA, NA, 0.0...
## $ magnesium_100g <dbl> NA, NA, NA, NA, 0.1...
## $ zinc_100g <dbl> NA, NA, NA, NA, NA,...
## $ copper_100g <dbl> NA, NA, NA, NA, NA,...
## $ manganese_100g <dbl> NA, NA, NA, NA, NA,...
## $ fluoride_100g <dbl> NA, NA, NA, NA, NA,...
## $ selenium_100g <dbl> NA, NA, NA, NA, NA,...
## $ chromium_100g <lgl> NA, NA, NA, NA, NA,...
## $ molybdenum_100g <lgl> NA, NA, NA, NA, NA,...
## $ iodine_100g <dbl> NA, NA, NA, NA, NA,...
## $ caffeine_100g <lgl> NA, NA, NA, NA, NA,...
## $ taurine_100g <lgl> NA, NA, NA, NA, NA,...
## $ ph_100g <lgl> NA, NA, NA, NA, NA,...
## $ fruits_vegetables_nuts_100g <dbl> 54, NA, NA, NA, NA,...
## $ collagen_meat_protein_ratio_100g <int> NA, NA, NA, NA, NA,...
## $ cocoa_100g <int> NA, NA, NA, NA, NA,...
## $ chlorophyl_100g <lgl> NA, NA, NA, NA, NA,...
## $ carbon_footprint_100g <dbl> NA, NA, NA, NA, NA,...
## $ nutrition_score_fr_100g <int> 11, NA, NA, 11, 17,...
## $ nutrition_score_uk_100g <int> 11, NA, NA, 11, 17,...
names(food)
## [1] "V1"
## [2] "code"
## [3] "url"
## [4] "creator"
## [5] "created_t"
## [6] "created_datetime"
## [7] "last_modified_t"
## [8] "last_modified_datetime"
## [9] "product_name"
## [10] "generic_name"
## [11] "quantity"
## [12] "packaging"
## [13] "packaging_tags"
## [14] "brands"
## [15] "brands_tags"
## [16] "categories"
## [17] "categories_tags"
## [18] "categories_en"
## [19] "origins"
## [20] "origins_tags"
## [21] "manufacturing_places"
## [22] "manufacturing_places_tags"
## [23] "labels"
## [24] "labels_tags"
## [25] "labels_en"
## [26] "emb_codes"
## [27] "emb_codes_tags"
## [28] "first_packaging_code_geo"
## [29] "cities"
## [30] "cities_tags"
## [31] "purchase_places"
## [32] "stores"
## [33] "countries"
## [34] "countries_tags"
## [35] "countries_en"
## [36] "ingredients_text"
## [37] "allergens"
## [38] "allergens_en"
## [39] "traces"
## [40] "traces_tags"
## [41] "traces_en"
## [42] "serving_size"
## [43] "no_nutriments"
## [44] "additives_n"
## [45] "additives"
## [46] "additives_tags"
## [47] "additives_en"
## [48] "ingredients_from_palm_oil_n"
## [49] "ingredients_from_palm_oil"
## [50] "ingredients_from_palm_oil_tags"
## [51] "ingredients_that_may_be_from_palm_oil_n"
## [52] "ingredients_that_may_be_from_palm_oil"
## [53] "ingredients_that_may_be_from_palm_oil_tags"
## [54] "nutrition_grade_uk"
## [55] "nutrition_grade_fr"
## [56] "pnns_groups_1"
## [57] "pnns_groups_2"
## [58] "states"
## [59] "states_tags"
## [60] "states_en"
## [61] "main_category"
## [62] "main_category_en"
## [63] "image_url"
## [64] "image_small_url"
## [65] "energy_100g"
## [66] "energy_from_fat_100g"
## [67] "fat_100g"
## [68] "saturated_fat_100g"
## [69] "butyric_acid_100g"
## [70] "caproic_acid_100g"
## [71] "caprylic_acid_100g"
## [72] "capric_acid_100g"
## [73] "lauric_acid_100g"
## [74] "myristic_acid_100g"
## [75] "palmitic_acid_100g"
## [76] "stearic_acid_100g"
## [77] "arachidic_acid_100g"
## [78] "behenic_acid_100g"
## [79] "lignoceric_acid_100g"
## [80] "cerotic_acid_100g"
## [81] "montanic_acid_100g"
## [82] "melissic_acid_100g"
## [83] "monounsaturated_fat_100g"
## [84] "polyunsaturated_fat_100g"
## [85] "omega_3_fat_100g"
## [86] "alpha_linolenic_acid_100g"
## [87] "eicosapentaenoic_acid_100g"
## [88] "docosahexaenoic_acid_100g"
## [89] "omega_6_fat_100g"
## [90] "linoleic_acid_100g"
## [91] "arachidonic_acid_100g"
## [92] "gamma_linolenic_acid_100g"
## [93] "dihomo_gamma_linolenic_acid_100g"
## [94] "omega_9_fat_100g"
## [95] "oleic_acid_100g"
## [96] "elaidic_acid_100g"
## [97] "gondoic_acid_100g"
## [98] "mead_acid_100g"
## [99] "erucic_acid_100g"
## [100] "nervonic_acid_100g"
## [101] "trans_fat_100g"
## [102] "cholesterol_100g"
## [103] "carbohydrates_100g"
## [104] "sugars_100g"
## [105] "sucrose_100g"
## [106] "glucose_100g"
## [107] "fructose_100g"
## [108] "lactose_100g"
## [109] "maltose_100g"
## [110] "maltodextrins_100g"
## [111] "starch_100g"
## [112] "polyols_100g"
## [113] "fiber_100g"
## [114] "proteins_100g"
## [115] "casein_100g"
## [116] "serum_proteins_100g"
## [117] "nucleotides_100g"
## [118] "salt_100g"
## [119] "sodium_100g"
## [120] "alcohol_100g"
## [121] "vitamin_a_100g"
## [122] "beta_carotene_100g"
## [123] "vitamin_d_100g"
## [124] "vitamin_e_100g"
## [125] "vitamin_k_100g"
## [126] "vitamin_c_100g"
## [127] "vitamin_b1_100g"
## [128] "vitamin_b2_100g"
## [129] "vitamin_pp_100g"
## [130] "vitamin_b6_100g"
## [131] "vitamin_b9_100g"
## [132] "vitamin_b12_100g"
## [133] "biotin_100g"
## [134] "pantothenic_acid_100g"
## [135] "silica_100g"
## [136] "bicarbonate_100g"
## [137] "potassium_100g"
## [138] "chloride_100g"
## [139] "calcium_100g"
## [140] "phosphorus_100g"
## [141] "iron_100g"
## [142] "magnesium_100g"
## [143] "zinc_100g"
## [144] "copper_100g"
## [145] "manganese_100g"
## [146] "fluoride_100g"
## [147] "selenium_100g"
## [148] "chromium_100g"
## [149] "molybdenum_100g"
## [150] "iodine_100g"
## [151] "caffeine_100g"
## [152] "taurine_100g"
## [153] "ph_100g"
## [154] "fruits_vegetables_nuts_100g"
## [155] "collagen_meat_protein_ratio_100g"
## [156] "cocoa_100g"
## [157] "chlorophyl_100g"
## [158] "carbon_footprint_100g"
## [159] "nutrition_score_fr_100g"
## [160] "nutrition_score_uk_100g"
# Conclusions:
# What and when information was added (1:9)
# Meta information about food (10:17, 22:27)
# Where it came from (18:21, 28:34)
# What it's made of (35:52)
# Nutrition grades (53:54)
# Unclear (55:63)
# Nutritional information (64:159).
# Some columns are duplicates:
duplicates <- c(4, 6, 11, 13, 15, 17, 18, 20, 22,
24, 25, 28, 32, 34, 36, 38, 40,
44, 46, 48, 51, 54, 65, 158)
food2 <- food[,-duplicates]
# some columns are useless:
useless <- c(1, 2, 3, 32:41)
food3 <- food2[,-useless]
# We care only about the nutrition cols, the ones with 100g in their name.
library(stringr)
nutrition <- str_detect(colnames(food3), "100g")
summary(food3[,nutrition])
## energy_from_fat_100g fat_100g saturated_fat_100g
## Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 35.98 1st Qu.: 0.90 1st Qu.: 0.200
## Median : 237.00 Median : 6.00 Median : 1.700
## Mean : 668.41 Mean : 13.39 Mean : 4.874
## 3rd Qu.: 974.00 3rd Qu.: 20.00 3rd Qu.: 6.500
## Max. :2900.00 Max. :100.00 Max. :57.000
## NA's :1486 NA's :708 NA's :797
## butyric_acid_100g caproic_acid_100g caprylic_acid_100g capric_acid_100g
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:1500 NA's:1500 NA's:1500 NA's:1500
##
##
##
##
##
## lauric_acid_100g myristic_acid_100g palmitic_acid_100g stearic_acid_100g
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:1500 NA's:1500 NA's:1500 NA's:1500
##
##
##
##
##
## arachidic_acid_100g behenic_acid_100g lignoceric_acid_100g
## Mode:logical Mode:logical Mode:logical
## NA's:1500 NA's:1500 NA's:1500
##
##
##
##
##
## cerotic_acid_100g montanic_acid_100g melissic_acid_100g
## Mode:logical Mode:logical Mode:logical
## NA's:1500 NA's:1500 NA's:1500
##
##
##
##
##
## monounsaturated_fat_100g polyunsaturated_fat_100g omega_3_fat_100g
## Min. : 0.00 Min. : 0.400 Min. : 0.033
## 1st Qu.: 3.87 1st Qu.: 1.653 1st Qu.: 1.300
## Median : 9.50 Median : 3.900 Median : 3.000
## Mean :19.77 Mean : 9.986 Mean : 3.726
## 3rd Qu.:29.00 3rd Qu.:12.700 3rd Qu.: 3.200
## Max. :75.00 Max. :46.200 Max. :12.400
## NA's :1465 NA's :1464 NA's :1491
## alpha_linolenic_acid_100g eicosapentaenoic_acid_100g
## Min. :0.0800 Min. :0.721
## 1st Qu.:0.0905 1st Qu.:0.721
## Median :0.1010 Median :0.721
## Mean :0.1737 Mean :0.721
## 3rd Qu.:0.2205 3rd Qu.:0.721
## Max. :0.3400 Max. :0.721
## NA's :1497 NA's :1499
## docosahexaenoic_acid_100g omega_6_fat_100g linoleic_acid_100g
## Min. :1.09 Min. :0.25 Min. :0.5000
## 1st Qu.:1.09 1st Qu.:0.25 1st Qu.:0.5165
## Median :1.09 Median :0.25 Median :0.5330
## Mean :1.09 Mean :0.25 Mean :0.5330
## 3rd Qu.:1.09 3rd Qu.:0.25 3rd Qu.:0.5495
## Max. :1.09 Max. :0.25 Max. :0.5660
## NA's :1499 NA's :1499 NA's :1498
## arachidonic_acid_100g gamma_linolenic_acid_100g
## Mode:logical Mode:logical
## NA's:1500 NA's:1500
##
##
##
##
##
## dihomo_gamma_linolenic_acid_100g omega_9_fat_100g oleic_acid_100g
## Mode:logical Mode:logical Mode:logical
## NA's:1500 NA's:1500 NA's:1500
##
##
##
##
##
## elaidic_acid_100g gondoic_acid_100g mead_acid_100g erucic_acid_100g
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:1500 NA's:1500 NA's:1500 NA's:1500
##
##
##
##
##
## nervonic_acid_100g trans_fat_100g cholesterol_100g carbohydrates_100g
## Mode:logical Min. :0.0000 Min. :0.0000 Min. : 0.000
## NA's:1500 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 3.792
## Median :0.0000 Median :0.0000 Median : 13.500
## Mean :0.0105 Mean :0.0265 Mean : 27.958
## 3rd Qu.:0.0000 3rd Qu.:0.0026 3rd Qu.: 55.000
## Max. :0.1000 Max. :0.4300 Max. :100.000
## NA's :1481 NA's :1477 NA's :708
## sugars_100g sucrose_100g glucose_100g fructose_100g
## Min. : 0.00 Mode:logical Mode:logical Min. :100
## 1st Qu.: 1.00 NA's:1500 NA's:1500 1st Qu.:100
## Median : 4.05 Median :100
## Mean : 12.66 Mean :100
## 3rd Qu.: 14.70 3rd Qu.:100
## Max. :100.00 Max. :100
## NA's :788 NA's :1499
## lactose_100g maltose_100g maltodextrins_100g starch_100g
## Min. :0.000 Mode:logical Mode:logical Min. : 0.00
## 1st Qu.:0.250 NA's:1500 NA's:1500 1st Qu.: 9.45
## Median :0.500 Median :39.50
## Mean :2.933 Mean :30.73
## 3rd Qu.:4.400 3rd Qu.:42.85
## Max. :8.300 Max. :71.00
## NA's :1497 NA's :1493
## polyols_100g fiber_100g proteins_100g casein_100g
## Min. : 8.60 Min. : 0.000 Min. : 0.000 Min. :1.1
## 1st Qu.:59.10 1st Qu.: 0.500 1st Qu.: 1.500 1st Qu.:1.1
## Median :67.00 Median : 1.750 Median : 6.000 Median :1.1
## Mean :56.06 Mean : 2.823 Mean : 7.563 Mean :1.1
## 3rd Qu.:69.80 3rd Qu.: 3.500 3rd Qu.:10.675 3rd Qu.:1.1
## Max. :70.00 Max. :46.700 Max. :61.000 Max. :1.1
## NA's :1491 NA's :994 NA's :710 NA's :1499
## serum_proteins_100g nucleotides_100g salt_100g sodium_100g
## Mode:logical Mode:logical Min. : 0.0000 Min. : 0.0000
## NA's:1500 NA's:1500 1st Qu.: 0.0438 1st Qu.: 0.0172
## Median : 0.4498 Median : 0.1771
## Mean : 1.1205 Mean : 0.4409
## 3rd Qu.: 1.1938 3rd Qu.: 0.4700
## Max. :102.0000 Max. :40.0000
## NA's :780 NA's :780
## alcohol_100g vitamin_a_100g beta_carotene_100g vitamin_d_100g
## Min. : 0.00 Min. :0.0000 Mode:logical Min. :0e+00
## 1st Qu.: 0.00 1st Qu.:0.0000 NA's:1500 1st Qu.:0e+00
## Median : 5.50 Median :0.0001 Median :0e+00
## Mean :10.07 Mean :0.0003 Mean :0e+00
## 3rd Qu.:13.00 3rd Qu.:0.0006 3rd Qu.:0e+00
## Max. :50.00 Max. :0.0013 Max. :1e-04
## NA's :1433 NA's :1477 NA's :1485
## vitamin_e_100g vitamin_k_100g vitamin_c_100g vitamin_b1_100g
## Min. :0.0005 Min. :0 Min. :0.000 Min. :0.0001
## 1st Qu.:0.0021 1st Qu.:0 1st Qu.:0.002 1st Qu.:0.0003
## Median :0.0044 Median :0 Median :0.019 Median :0.0004
## Mean :0.0069 Mean :0 Mean :0.025 Mean :0.0006
## 3rd Qu.:0.0097 3rd Qu.:0 3rd Qu.:0.030 3rd Qu.:0.0010
## Max. :0.0320 Max. :0 Max. :0.217 Max. :0.0013
## NA's :1478 NA's :1498 NA's :1459 NA's :1478
## vitamin_b2_100g vitamin_pp_100g vitamin_b6_100g vitamin_b9_100g
## Min. :0.0002 Min. :0.0006 Min. :0.0001 Min. :0e+00
## 1st Qu.:0.0003 1st Qu.:0.0033 1st Qu.:0.0002 1st Qu.:0e+00
## Median :0.0009 Median :0.0069 Median :0.0008 Median :1e-04
## Mean :0.0011 Mean :0.0086 Mean :0.0112 Mean :1e-04
## 3rd Qu.:0.0013 3rd Qu.:0.0140 3rd Qu.:0.0012 3rd Qu.:2e-04
## Max. :0.0066 Max. :0.0160 Max. :0.2000 Max. :2e-04
## NA's :1483 NA's :1484 NA's :1481 NA's :1483
## vitamin_b12_100g biotin_100g pantothenic_acid_100g silica_100g
## Min. :0 Min. :0 Min. :0.0000 Min. :8e-04
## 1st Qu.:0 1st Qu.:0 1st Qu.:0.0007 1st Qu.:8e-04
## Median :0 Median :0 Median :0.0020 Median :8e-04
## Mean :0 Mean :0 Mean :0.0027 Mean :8e-04
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0.0051 3rd Qu.:8e-04
## Max. :0 Max. :0 Max. :0.0060 Max. :8e-04
## NA's :1489 NA's :1498 NA's :1486 NA's :1499
## bicarbonate_100g potassium_100g chloride_100g calcium_100g
## Min. :0.0006 Min. :0.0000 Min. :0.0003 Min. :0.0000
## 1st Qu.:0.0678 1st Qu.:0.0650 1st Qu.:0.0006 1st Qu.:0.0450
## Median :0.1350 Median :0.1940 Median :0.0009 Median :0.1200
## Mean :0.1692 Mean :0.3288 Mean :0.0144 Mean :0.2040
## 3rd Qu.:0.2535 3rd Qu.:0.3670 3rd Qu.:0.0214 3rd Qu.:0.1985
## Max. :0.3720 Max. :1.4300 Max. :0.0420 Max. :1.0000
## NA's :1497 NA's :1487 NA's :1497 NA's :1449
## phosphorus_100g iron_100g magnesium_100g zinc_100g
## Min. :0.0430 Min. :0.0000 Min. :0.0000 Min. :0.0005
## 1st Qu.:0.1938 1st Qu.:0.0012 1st Qu.:0.0670 1st Qu.:0.0009
## Median :0.3185 Median :0.0042 Median :0.1040 Median :0.0017
## Mean :0.3777 Mean :0.0045 Mean :0.1066 Mean :0.0016
## 3rd Qu.:0.4340 3rd Qu.:0.0077 3rd Qu.:0.1300 3rd Qu.:0.0022
## Max. :1.1550 Max. :0.0137 Max. :0.3330 Max. :0.0026
## NA's :1488 NA's :1463 NA's :1479 NA's :1493
## copper_100g manganese_100g fluoride_100g selenium_100g
## Min. :0e+00 Min. :0 Min. :0 Min. :0
## 1st Qu.:1e-04 1st Qu.:0 1st Qu.:0 1st Qu.:0
## Median :1e-04 Median :0 Median :0 Median :0
## Mean :1e-04 Mean :0 Mean :0 Mean :0
## 3rd Qu.:1e-04 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :1e-04 Max. :0 Max. :0 Max. :0
## NA's :1498 NA's :1499 NA's :1498 NA's :1499
## chromium_100g molybdenum_100g iodine_100g caffeine_100g
## Mode:logical Mode:logical Min. :0 Mode:logical
## NA's:1500 NA's:1500 1st Qu.:0 NA's:1500
## Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
## NA's :1499
## taurine_100g ph_100g fruits_vegetables_nuts_100g
## Mode:logical Mode:logical Min. : 2.00
## NA's:1500 NA's:1500 1st Qu.:11.25
## Median :42.00
## Mean :36.88
## 3rd Qu.:52.25
## Max. :80.00
## NA's :1470
## collagen_meat_protein_ratio_100g cocoa_100g chlorophyl_100g
## Min. :12.00 Min. :30 Mode:logical
## 1st Qu.:13.50 1st Qu.:47 NA's:1500
## Median :15.00 Median :60
## Mean :15.67 Mean :57
## 3rd Qu.:17.50 3rd Qu.:70
## Max. :20.00 Max. :81
## NA's :1497 NA's :1491
## nutrition_score_fr_100g nutrition_score_uk_100g
## Min. :-12.000 Min. :-12.000
## 1st Qu.: 1.000 1st Qu.: 0.000
## Median : 7.000 Median : 6.000
## Mean : 7.941 Mean : 7.631
## 3rd Qu.: 15.000 3rd Qu.: 16.000
## Max. : 28.000 Max. : 28.000
## NA's :825 NA's :825
# Replace NAs with 0.
missing <- is.na(food3$sugars_100g)
food3$sugars_100g[missing] <- 0
food4 <- food3[(food3$sugars_100g > 0), ]
# How many observations are packaged in plastic?
plastic <- str_detect(food3$packaging, "plasti")
sum(plastic)
## [1] 232
library(readxl)
att_url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_1294/datasets/attendance.xls"
# Cannot read Excel directly from internet, so downloaded to local drive.
# Following command works, but content is unreadable. Comment out.
#download.file(att_url, file.path("C:/Users/mpfol/OneDrive/Documents/Data Science/Data", "att.xls"))
att_path <- file.path("C:/Users/mpfol/OneDrive/Documents/Data Science/Data", "attendance.xls")
att <- read_excel(att_path, skip = 1)
Types of variables: * Numeric (quantitative). * * continuous * * discrete * Categorical (qualitative) * * oridinal * * nominal (a.k.a., categorical)
Identify data subsets with table()
. Subset the data with dplyr::filter
. The subsetted data will still contain an empty bin for the unused factor levels. If that is a problem, drop them with droplevels()
.
library(openintro)
## Warning: package 'openintro' was built under R version 3.4.4
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following objects are masked _by_ '.GlobalEnv':
##
## cars, iris
## The following object is masked from 'package:reshape2':
##
## tips
## The following object is masked from 'package:ggplot2':
##
## diamonds
## The following objects are masked from 'package:datasets':
##
## cars, trees
library(dplyr)
# load data from library. Note that with lazy-loading, it is almost never necesary to load data from packages this way.
data(hsb2)
# hsb2$schtyp has two levels, public and private. Filtering for just public does not change the level definition of hsb2$schtyp. Use droplevels to remove private.
table(hsb2$schtyp)
##
## public private
## 168 32
hsb2_public <- hsb2 %>%
filter(schtyp == "public")
table(hsb2_public$schtyp)
##
## public private
## 168 0
hsb2_public$schtyp <- droplevels(hsb2_public$schtyp)
table(hsb2_public$schtyp)
##
## public
## 168
Discretize a numeric variable into a categorical variable with ifelse
or casewhen
.
library(openintro)
library(dplyr)
data(hsb2)
# parens cause output to go both to variable and to io.
(med_read <- median(hsb2$read))
## [1] 50
hsb2 <- hsb2 %>%
mutate(read_cat = ifelse(read < med_read, "< median",
ifelse(read == med_read, "median", "> median")))
hsb2 %>%
count(read_cat)
## # A tibble: 3 x 2
## read_cat n
## <chr> <int>
## 1 < median 83
## 2 > median 99
## 3 median 18
hsb2 <- hsb2 %>%
mutate(
race_white_oth = case_when(
race == "white" ~ "white",
race != "white" ~ "other"
)
)
Visualize numerical data with scatterplots.
Random sampling allows for generalization and applies to both observational studies and experiments. Random assignment allows from causation conclusions and applies only to experiments.
library(openintro)
library(dplyr)
data(hsb2)
hsb2 %>%
count(race, schtyp) %>%
group_by(race) %>%
mutate(prop = n / sum(n))
## # A tibble: 8 x 4
## # Groups: race [4]
## race schtyp n prop
## <chr> <fct> <int> <dbl>
## 1 african american public 18 0.900
## 2 african american private 2 0.100
## 3 asian public 10 0.909
## 4 asian private 1 0.0909
## 5 hispanic public 22 0.917
## 6 hispanic private 2 0.0833
## 7 white public 118 0.814
## 8 white private 27 0.186
Censuses are expensive, and the underlying population changes anyway. Instead we sample. Four common methods are simple random sampling, stratified sampling (group into strata first to guarantee equal representation), clustered sampling (cluster population, randomly choose clusters, then take census), and multi-stage sampling (cluster population, randomly choose clusters, then randomly sample).
library(openintro)
library(dplyr)
data(county)
county_noDC <- county %>%
filter(state != "District of Columbia") %>%
droplevels()
# simple random sample
county_srs <- county_noDC %>%
sample_n(size = 150)
county_srs %>%
group_by(state) %>%
count()
## # A tibble: 40 x 2
## # Groups: state [40]
## state n
## <fct> <int>
## 1 Alabama 4
## 2 Alaska 1
## 3 Arkansas 5
## 4 California 1
## 5 Colorado 3
## 6 Florida 3
## 7 Georgia 3
## 8 Hawaii 1
## 9 Idaho 3
## 10 Illinois 5
## # ... with 30 more rows
# stratified sample
county_ss <- county_noDC %>%
group_by(state) %>%
sample_n(size = 3)
county_ss %>%
group_by(state) %>%
count()
## # A tibble: 50 x 2
## # Groups: state [50]
## state n
## <fct> <int>
## 1 Alabama 3
## 2 Alaska 3
## 3 Arizona 3
## 4 Arkansas 3
## 5 California 3
## 6 Colorado 3
## 7 Connecticut 3
## 8 Delaware 3
## 9 Florida 3
## 10 Georgia 3
## # ... with 40 more rows
The principles of experimential design are control, randomize, replicate (sufficiently large sample), block (control for confounding variables). Explanatory variables are conditions you can impose on the experimental units, while blocking variables are characteristics that the experimental units come with that you would like to control for. Control for a variable with stratfying in random sampling, and blocking in random assigment.
This section covers graphical and numerical summaries of categorical variables. levels(x)
provides access (display or write) to the levels of a factor variable. Another way to get the levels is with a table. Explore the relationship between two categorical variables with a contingency table, or a stack bar char.
library(readr)
library(dplyr)
library(ggplot2)
comics <- read_csv("https://assets.datacamp.com/production/course_1796/datasets/comics.csv")
## Parsed with column specification:
## cols(
## name = col_character(),
## id = col_character(),
## align = col_character(),
## eye = col_character(),
## hair = col_character(),
## gender = col_character(),
## gsm = col_character(),
## alive = col_character(),
## appearances = col_integer(),
## first_appear = col_character(),
## publisher = col_character()
## )
# Drop underrepresented data from analysis
comics <- comics %>%
filter(align != "Reformed Criminals") %>%
droplevels()
# Convert character strings to factors
comics <- comics %>%
mutate(name = as.factor(name),
id = factor(id),
align = factor(align,
levels = c("Bad", "Neutral", "Good")), # sets order
eye = factor(eye),
hair = factor(hair),
gender = factor(gender),
alive = factor(alive),
first_appear = factor(first_appear),
publisher = factor(publisher)
)
# Levels of align, gender
levels(comics$align)
## [1] "Bad" "Neutral" "Good"
levels(comics$gender)
## [1] "Female" "Male" "Other"
# Table of counts, (how to calculate proportions?)
table(comics$align)
##
## Bad Neutral Good
## 9615 2773 7468
#margin.table(comics$align)
# Contingency table or proportional table
table(comics$align, comics$gender)
##
## Female Male Other
## Bad 1573 7561 32
## Neutral 836 1799 17
## Good 2490 4809 17
options(scipen = 999, digits = 3) # sig digits
prop.table(table(comics$align, comics$gender))
##
## Female Male Other
## Bad 0.082210 0.395160 0.001672
## Neutral 0.043692 0.094021 0.000888
## Good 0.130135 0.251333 0.000888
# Conditional proportion. Condition on rows (margin = 1), or cols (margin = 2).
prop.table(table(comics$align, comics$gender), margin = 1)
##
## Female Male Other
## Bad 0.17161 0.82490 0.00349
## Neutral 0.31523 0.67836 0.00641
## Good 0.34035 0.65733 0.00232
# Marginal bar-plot of gender counts
ggplot(comics, aes(x = gender)) +
geom_bar()
# Improve by ordering
comics$gender = factor(comics$gender,
levels = c("Male", "Female", "Other"))
ggplot(comics, aes(x = gender)) +
geom_bar()
# Conditional stacked bar-plot of align counts conditioned on gender
ggplot(comics, aes(x = gender, fill = align)) +
geom_bar()
# Conditional stacked bar-chart of align proportions conditioned on gender
ggplot(comics, aes(x = gender, fill = align)) +
geom_bar(position = "fill") +
ylab("proportion")
# Conditional faceted bar-bart of align counts conditioned on gender
ggplot(comics, aes(x = align)) +
geom_bar() +
facet_wrap(~ gender)
# Same, but as side-by-side barchart
ggplot(comics, aes(x = gender, fill = align)) +
geom_bar(position = "dodge")
# Create side-by-side barchart of alignment by gender
ggplot(comics, aes(x = gender, fill = align)) +
geom_bar(position = "dodge") +
theme(axis.text.x = element_text(angle = 90)) # vertical x-axis labels
For univariate exploration, visualize discrete numerical data with limited values using geom_dotplot()
. Otherwise, use geom_histogram()
, or density plot geom_density()
. geom_boxplot()
is also possible, but it usually in bivariate exploration.
library(readr)
library(ggplot2)
cars <- read_csv("https://assets.datacamp.com/production/course_1796/datasets/cars04.csv")
## Parsed with column specification:
## cols(
## name = col_character(),
## sports_car = col_logical(),
## suv = col_logical(),
## wagon = col_logical(),
## minivan = col_logical(),
## pickup = col_logical(),
## all_wheel = col_logical(),
## rear_wheel = col_logical(),
## msrp = col_integer(),
## dealer_cost = col_integer(),
## eng_size = col_double(),
## ncyl = col_integer(),
## horsepwr = col_integer(),
## city_mpg = col_integer(),
## hwy_mpg = col_integer(),
## weight = col_integer(),
## wheel_base = col_integer(),
## length = col_integer(),
## width = col_integer()
## )
# Dotplot for discrete numerical variable with limited distinct values
ggplot(cars, aes(x = weight)) +
geom_dotplot(dotsize = 0.4) +
labs(title = "Car Weight Distribution",
subtitle = "Dot plots can get unweieldy, but show the most detail.")
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bindot).
# Histogram for any numerical data with many distinct values
ggplot(cars, aes(x = weight)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).
For bi-variate analysis of a numerical variable with a factor variable, use a geom_boxplot()
, or overlaid geom_density()
. Both density plots and box plots display the central tendency and spread of the data. The box plot is more robust to outliers. The density plot reveals multi-modal distributions.
Add a third dimension to plots with facet_grid(rowvar ~ colvar)
or mapping to shape, color, size, pattern, movement, x-coord, or y-coord.
library(readr)
library(dplyr)
library(ggplot2)
cars <- read_csv("https://assets.datacamp.com/production/course_1796/datasets/cars04.csv")
## Parsed with column specification:
## cols(
## name = col_character(),
## sports_car = col_logical(),
## suv = col_logical(),
## wagon = col_logical(),
## minivan = col_logical(),
## pickup = col_logical(),
## all_wheel = col_logical(),
## rear_wheel = col_logical(),
## msrp = col_integer(),
## dealer_cost = col_integer(),
## eng_size = col_double(),
## ncyl = col_integer(),
## horsepwr = col_integer(),
## city_mpg = col_integer(),
## hwy_mpg = col_integer(),
## weight = col_integer(),
## wheel_base = col_integer(),
## length = col_integer(),
## width = col_integer()
## )
# Box plots of city mpg by ncyl.
# (note: set aes(x = 1) to create a single boxplot)
ggplot(cars, aes(x = as.factor(ncyl), y = city_mpg)) +
geom_boxplot()
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).
# Overlaid density plots for same data
ggplot(cars, aes(x = city_mpg, fill = as.factor(ncyl))) +
geom_density(alpha = .3)
## Warning: Removed 14 rows containing non-finite values (stat_density).
## Warning: Groups with fewer than two data points have been dropped.
# Use piping with a filter to single out interesting subsets of numerical data.
cars %>%
filter(msrp < 25000) %>%
ggplot(aes(x = horsepwr)) +
geom_histogram(binwidth = 3) +
xlim(c(90, 550)) +
labs(title = "Distribution of horsepower for cars under $25k (bw = 3)")
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
# In this case, the box plot works better because of the wide range of outliers
cars %>%
ggplot(aes(x = city_mpg)) +
geom_density()
## Warning: Removed 14 rows containing non-finite values (stat_density).
cars %>%
ggplot(aes(x = 1, y = city_mpg)) +
geom_boxplot()
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).
# In this case, the density plot works better because of the three modes.
cars %>%
ggplot(aes(x = width)) +
geom_density()
## Warning: Removed 28 rows containing non-finite values (stat_density).
cars %>%
ggplot(aes(x = 1, y = width)) +
geom_boxplot()
## Warning: Removed 28 rows containing non-finite values (stat_boxplot).
# Facet hists using hwy mileage and ncyl
cars %>%
filter(ncyl %in% c(2, 4, 6)) %>%
ggplot(aes(x = hwy_mpg)) +
geom_histogram() +
facet_grid(ncyl ~ suv, labeller = label_both) +
ggtitle("hwy_mpg distribution, ncyl vs suv")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 7 rows containing non-finite values (stat_bin).
Measures of center include mean, median, and mode. Measures of variability include var()
, sd(x)
, IQR(x)
, and diff(range(x))
. The IQR is useful for heavily data sets that are skewed or have extreme observations (mean + sd, and median + IQR). Measures of shape include modality (uniform, unimodal, bimodal, multimodal), and skew (symmetric, right-skewed, left-skewed). Highly skewed distributions can make it very difficult to learn anything from a visualization. Transformations can be helpful in revealing the more subtle structure. Transform skewed data with the natural logarithm function log()
.
library(readr)
life <- read_csv("https://assets.datacamp.com/production/course_1796/datasets/life_exp_raw.csv")
## Parsed with column specification:
## cols(
## State = col_character(),
## County = col_character(),
## fips = col_integer(),
## Year = col_integer(),
## `Female life expectancy (years)` = col_double(),
## `Female life expectancy (national, years)` = col_double(),
## `Female life expectancy (state, years)` = col_double(),
## `Male life expectancy (years)` = col_double(),
## `Male life expectancy (national, years)` = col_double(),
## `Male life expectancy (state, years)` = col_double()
## )
life$expectancy_f <- life$`Female life expectancy (years)`
life$`Female life expectancy (years)` <- NULL
life$expect_f_nat <- life$`Female life expectancy (national, years)`
life$'Female life expectancy (national, years)' <- NULL
life$expect_f_st <- life$`Female life expectancy (state, years)`
life$'Female life expectancy (state, years)' <- NULL
life$expect_m <- life$`Male life expectancy (years)`
life$'Male life expectancy (years)' <- NULL
life$expect_m_nat <- life$`Male life expectancy (national, years)`
life$'Male life expectancy (national, years)' <- NULL
life$expect_m_st <- life$`Male life expectancy (state, years)`
life$'Male life expectancy (state, years)' <- NULL
# Do west-coasts states have a higher life-expectancy?
life <- life %>%
mutate(west_coast = State %in% c("California", "Oregon", "Washington"))
life %>%
group_by(west_coast) %>%
summarize(mean(expect_m),
median(expect_m))
## # A tibble: 2 x 3
## west_coast `mean(expect_m)` `median(expect_m)`
## <lgl> <dbl> <dbl>
## 1 FALSE 72.6 72.8
## 2 TRUE 74.4 74.3
ggplot(life, aes(x = west_coast, y = expect_m)) +
geom_boxplot() +
labs(title = "Male Life Expectancy Distribution",
x = "West Coast State")
ggplot(life, aes(x = expect_m, fill = west_coast)) +
geom_density(alpha = 0.3) +
labs(title = "Male Life Expectancy Distribution")
What attributes of an email are associated with spam?
library(openintro)
library(ggplot2)
library(dplyr)
email2 <- email %>%
mutate(spam = factor(ifelse(spam == 1, "spam", "not-spam")))
# Is it size?
email2 %>%
group_by(spam) %>%
summarize(
# mean(num_char),
# sd(num_char),
median(num_char),
IQR(num_char)
)
## # A tibble: 2 x 3
## spam `median(num_char)` `IQR(num_char)`
## <fct> <dbl> <dbl>
## 1 not-spam 6.83 13.6
## 2 spam 1.05 2.82
email2 %>%
mutate(log_num_char = log(num_char)) %>%
ggplot(aes(x = spam, y = log_num_char)) +
geom_boxplot()
# Conclusions:
# The typical spam email is considerably shorter, but there is still a lot of overlap.
# is it number of exclamation marks?
# If not unimodal, then use faceted histogram or density. Otherwise, use boxplot. Appropriate summary stats for boxplot are median and IQR, and mean, sd otherwise.
email2 %>%
group_by(spam) %>%
summarize(median(exclaim_mess),
IQR(exclaim_mess),
mean(exclaim_mess),
sd(exclaim_mess))
## # A tibble: 2 x 5
## spam `median(exclaim_mess)` `IQR(exclaim_mess)` `mean(exclaim_mess)`
## <fct> <dbl> <dbl> <dbl>
## 1 not-spam 1. 5. 6.51
## 2 spam 0. 1. 7.32
## # ... with 1 more variable: `sd(exclaim_mess)` <dbl>
# Create plot for spam and exclaim_mess
ggplot(email2, aes(x = spam, y = log(exclaim_mess + 0.1))) +
geom_boxplot()
ggplot(email2, aes(x = log(exclaim_mess+.01), fill = spam)) +
geom_density(alpha = 0.6)
# Histogram seems most helpful
email2 %>%
mutate(log_exclaim_mess = log(exclaim_mess+0.01)) %>%
ggplot(aes(x = log_exclaim_mess)) +
geom_histogram() +
facet_wrap(~ spam)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Conclusions:
# Most common is 0 or 1 exclamation marks in both classes of email.
# Even after transformation, the distribution is right-skewed in both classes of email .
# The typical number of exclamations in the not-spam group appears to be slightly higher than in the spam group.
# What to do when there are so many emails with zero exclamation marks? One strategy is to analyze the two separate. Another is to collapse them into a two level categorical variable.
# How about number of images? There are 3,811 instances of no images and just a few with >=1. Colapse the image variable into a logical.
table(email2$image)
##
## 0 1 2 3 4 5 9 20
## 3811 76 17 11 2 2 1 1
email %>%
mutate(has_image = (image > 0)) %>%
ggplot(aes(x = has_image, fill = spam)) +
geom_bar(position = "fill")
# Check data integrity.
# There should be no instances of more images than attachments if images are a form a attachment.
sum(email$image > email$attach)
## [1] 0
UN Voting Dataset.
library(dplyr)
library(broom)
library(tidyr)
library(purrr)
#library(countrycode)
votes <- readRDS("C:/Users/mpfol/OneDrive/Documents/Data Analysis/Data/votes.rds")
glimpse(votes)
## Observations: 508,929
## Variables: 4
## $ rcid <dbl> 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46...
## $ session <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ vote <dbl> 1, 1, 9, 1, 1, 1, 9, 9, 9, 9, 9, 9, 9, 9, 9, 1, 9, 1, ...
## $ ccode <int> 2, 20, 31, 40, 41, 42, 51, 52, 53, 54, 55, 56, 57, 58,...
descriptions <- readRDS("C:/Users/mpfol/OneDrive/Documents/Data Analysis/Data/descriptions.rds")
# There are six columns in the descriptions dataset that describe the topic of a resolution:
#me: Palestinian conflict
#nu: Nuclear weapons and nuclear material
#di: Arms control and disarmament
#hr: Human rights
#co: Colonialism
#ec: Economic development
# The vote column represents the country's vote:
# 1 = Yes
# 2 = Abstain
# 3 = No
# 8 = Not present
# 9 = Not a member
# Get rid of "Not present" and "Not a member".
# The first session was 1946, so create column year = session + 1945.
votes_processed <- votes %>%
filter(vote %in% c(1, 2, 3)) %>%
mutate(year = session + 1945)
# Join data sets
votes_joined <- inner_join(votes_processed, descriptions, c("rcid", "session"))
## Warning: Column `rcid` has different attributes on LHS and RHS of join
## Warning: Column `session` has different attributes on LHS and RHS of join
votes_gathered <- votes_joined %>%
gather(key = topic, value = has_topic, c(me:ec)) %>%
filter(has_topic != 0)
votes_tidied <- votes_gathered %>%
mutate(topic = recode(topic,
"me" = "Palestinian conflict",
"nu" = "Nuclear weapons and nuclear material",
"di" = "Arms control and disarmament",
"hr" = "Human rights",
"co" = "Colonialism",
"ec" = "Economic development"))
by_country_year_topic <- votes_tidied %>%
group_by(ccode, year, topic) %>%
summarize(total = n(),
percent_yes = mean(vote == 1)) %>%
ungroup()
US_by_country_year_topic <- by_country_year_topic %>%
#filter(country == "United States")
filter(ccode == 20)
# Plot % yes over time for the US, faceting by topic
ggplot(US_by_country_year_topic, aes(x = year, y = percent_yes)) +
geom_line() +
facet_wrap(~ topic)
# Create a model by country
# nest all columns othere than the key into tibles
# map the model to each tibble
# use function tidy to create tibble from model summary
# unmap the columns back into main data frame.
country_topic_coefficients <- by_country_year_topic %>%
nest(-ccode, -topic) %>%
mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)),
tidied = map(model, tidy)) %>%
unnest(tidied)
## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable
country_topic_filtered <- country_topic_coefficients %>%
filter(term == "year") %>%
mutate(p.adjusted = p.adjust(p.value)) %>%
filter(p.adjusted < 0.05)
vanuatu_by_country_year_topic <- by_country_year_topic %>%
filter(ccode == 20)
# Plot of percentage "yes" over time, faceted by topic
ggplot(vanuatu_by_country_year_topic, aes(x = year, y = percent_yes)) +
geom_line() +
facet_wrap(~ topic)
US_co_by_year <- votes_joined %>%
filter(ccode == 20, co == 1) %>%
group_by(year) %>%
summarize(percent_yes = mean(vote == 1))
# Graph the % of "yes" votes over time
ggplot(US_co_by_year, aes(x = year, y = percent_yes)) +
geom_line()
by_year_country <- votes_joined %>%
group_by(ccode, year) %>%
summarize(total = n(),
percent_yes = mean(vote == 1))
US_by_year <- by_year_country %>%
filter(ccode == 2)
US_fit <- lm(percent_yes ~ year, US_by_year)
# Fit model for the United Kingdom
UK_by_year <- by_year_country %>%
filter(ccode == 20)
UK_fit <- lm(percent_yes ~ year, UK_by_year)
US_tidied <- tidy(US_fit)
UK_tidied <- tidy(UK_fit)
# Combine the two tidied models
bind_rows(US_tidied, UK_tidied)
## term estimate std.error statistic p.value
## 1 (Intercept) 12.66415 1.837974 6.89 0.0000000848
## 2 year -0.00624 0.000928 -6.72 0.0000001367
## 3 (Intercept) -2.48418 1.891412 -1.31 0.1983885448
## 4 year 0.00152 0.000955 1.59 0.1223589541
Right now, the by_year_country data frame has one row per country-vote pair. So that you can model each country individually, you’re going to “nest” all columns besides country, which will result in a data frame with one row per country. The data for each individual country will then be stored in a list column called data.
library(tidyr)
library(purrr)
country_coefficients <- by_year_country %>%
nest(-ccode) %>%
mutate(models = map(data, ~ lm(percent_yes ~ year, .))) %>%
mutate(tidied = map(models, tidy)) %>%
unnest(tidied)
# when you have lots of p-values, like one for each country, you run into the problem of multiple hypothesis testing, where you have to set a stricter threshold. The p.adjust() function is a simple way to correct for this, where p.adjust(p.value) on a vector of p-values returns a set that you can trust.
country_coefficients %>%
filter(term == "year") %>%
filter(p.adjust(p.value) < 0.05)
## # A tibble: 61 x 6
## ccode term estimate std.error statistic p.value
## <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2 year -0.00624 0.000928 -6.72 1.37e- 7
## 2 40 year 0.00461 0.000721 6.40 3.43e- 7
## 3 41 year 0.00538 0.000699 7.70 8.82e- 9
## 4 42 year 0.00806 0.000914 8.81 5.96e-10
## 5 70 year 0.00530 0.000884 6.00 1.08e- 6
## 6 90 year 0.00585 0.00104 5.62 3.27e- 6
## 7 91 year 0.00772 0.000921 8.38 1.43e- 9
## 8 92 year 0.00614 0.000851 7.22 3.38e- 8
## 9 93 year 0.00708 0.00107 6.60 1.92e- 7
## 10 94 year 0.00654 0.000812 8.05 3.39e- 9
## # ... with 51 more rows
Join datasets.
votes <- readRDS("C:/Users/mpfol/OneDrive/Documents/Data Analysis/Data/votes.rds")
Statistical inference is the process of making claims about a population based on information from a sample. Inferential statistics attempts to reject a null hypothesis, \(H_0\).
The logic of statistical inference is to compare an observed statistic to the distribution of statistics arising from the null distribution.
Here is a brief exploration of the data used in this section.
library(NHANES)
## Warning: package 'NHANES' was built under R version 3.4.4
library(ggplot2)
library(dplyr)
glimpse(NHANES)
## Observations: 10,000
## Variables: 76
## $ ID <int> 51624, 51624, 51624, 51625, 51630, 51638, 516...
## $ SurveyYr <fct> 2009_10, 2009_10, 2009_10, 2009_10, 2009_10, ...
## $ Gender <fct> male, male, male, male, female, male, male, f...
## $ Age <int> 34, 34, 34, 4, 49, 9, 8, 45, 45, 45, 66, 58, ...
## $ AgeDecade <fct> 30-39, 30-39, 30-39, 0-9, 40-49, 0-9, ...
## $ AgeMonths <int> 409, 409, 409, 49, 596, 115, 101, 541, 541, 5...
## $ Race1 <fct> White, White, White, Other, White, White, Whi...
## $ Race3 <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Education <fct> High School, High School, High School, NA, So...
## $ MaritalStatus <fct> Married, Married, Married, NA, LivePartner, N...
## $ HHIncome <fct> 25000-34999, 25000-34999, 25000-34999, 20000-...
## $ HHIncomeMid <int> 30000, 30000, 30000, 22500, 40000, 87500, 600...
## $ Poverty <dbl> 1.36, 1.36, 1.36, 1.07, 1.91, 1.84, 2.33, 5.0...
## $ HomeRooms <int> 6, 6, 6, 9, 5, 6, 7, 6, 6, 6, 5, 10, 6, 10, 1...
## $ HomeOwn <fct> Own, Own, Own, Own, Rent, Rent, Own, Own, Own...
## $ Work <fct> NotWorking, NotWorking, NotWorking, NA, NotWo...
## $ Weight <dbl> 87.4, 87.4, 87.4, 17.0, 86.7, 29.8, 35.2, 75....
## $ Length <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ HeadCirc <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Height <dbl> 165, 165, 165, 105, 168, 133, 131, 167, 167, ...
## $ BMI <dbl> 32.2, 32.2, 32.2, 15.3, 30.6, 16.8, 20.6, 27....
## $ BMICatUnder20yrs <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ BMI_WHO <fct> 30.0_plus, 30.0_plus, 30.0_plus, 12.0_18.5, 3...
## $ Pulse <int> 70, 70, 70, NA, 86, 82, 72, 62, 62, 62, 60, 6...
## $ BPSysAve <int> 113, 113, 113, NA, 112, 86, 107, 118, 118, 11...
## $ BPDiaAve <int> 85, 85, 85, NA, 75, 47, 37, 64, 64, 64, 63, 7...
## $ BPSys1 <int> 114, 114, 114, NA, 118, 84, 114, 106, 106, 10...
## $ BPDia1 <int> 88, 88, 88, NA, 82, 50, 46, 62, 62, 62, 64, 7...
## $ BPSys2 <int> 114, 114, 114, NA, 108, 84, 108, 118, 118, 11...
## $ BPDia2 <int> 88, 88, 88, NA, 74, 50, 36, 68, 68, 68, 62, 7...
## $ BPSys3 <int> 112, 112, 112, NA, 116, 88, 106, 118, 118, 11...
## $ BPDia3 <int> 82, 82, 82, NA, 76, 44, 38, 60, 60, 60, 64, 7...
## $ Testosterone <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ DirectChol <dbl> 1.29, 1.29, 1.29, NA, 1.16, 1.34, 1.55, 2.12,...
## $ TotChol <dbl> 3.49, 3.49, 3.49, NA, 6.70, 4.86, 4.09, 5.82,...
## $ UrineVol1 <int> 352, 352, 352, NA, 77, 123, 238, 106, 106, 10...
## $ UrineFlow1 <dbl> NA, NA, NA, NA, 0.094, 1.538, 1.322, 1.116, 1...
## $ UrineVol2 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ UrineFlow2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Diabetes <fct> No, No, No, No, No, No, No, No, No, No, No, N...
## $ DiabetesAge <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ HealthGen <fct> Good, Good, Good, NA, Good, NA, NA, Vgood, Vg...
## $ DaysPhysHlthBad <int> 0, 0, 0, NA, 0, NA, NA, 0, 0, 0, 10, 0, 4, NA...
## $ DaysMentHlthBad <int> 15, 15, 15, NA, 10, NA, NA, 3, 3, 3, 0, 0, 0,...
## $ LittleInterest <fct> Most, Most, Most, NA, Several, NA, NA, None, ...
## $ Depressed <fct> Several, Several, Several, NA, Several, NA, N...
## $ nPregnancies <int> NA, NA, NA, NA, 2, NA, NA, 1, 1, 1, NA, NA, N...
## $ nBabies <int> NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, NA...
## $ Age1stBaby <int> NA, NA, NA, NA, 27, NA, NA, NA, NA, NA, NA, N...
## $ SleepHrsNight <int> 4, 4, 4, NA, 8, NA, NA, 8, 8, 8, 7, 5, 4, NA,...
## $ SleepTrouble <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, N...
## $ PhysActive <fct> No, No, No, NA, No, NA, NA, Yes, Yes, Yes, Ye...
## $ PhysActiveDays <int> NA, NA, NA, NA, NA, NA, NA, 5, 5, 5, 7, 5, 1,...
## $ TVHrsDay <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ CompHrsDay <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ TVHrsDayChild <int> NA, NA, NA, 4, NA, 5, 1, NA, NA, NA, NA, NA, ...
## $ CompHrsDayChild <int> NA, NA, NA, 1, NA, 0, 6, NA, NA, NA, NA, NA, ...
## $ Alcohol12PlusYr <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes...
## $ AlcoholDay <int> NA, NA, NA, NA, 2, NA, NA, 3, 3, 3, 1, 2, 6, ...
## $ AlcoholYear <int> 0, 0, 0, NA, 20, NA, NA, 52, 52, 52, 100, 104...
## $ SmokeNow <fct> No, No, No, NA, Yes, NA, NA, NA, NA, NA, No, ...
## $ Smoke100 <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, Y...
## $ Smoke100n <fct> Smoker, Smoker, Smoker, NA, Smoker, NA, NA, N...
## $ SmokeAge <int> 18, 18, 18, NA, 38, NA, NA, NA, NA, NA, 13, N...
## $ Marijuana <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes...
## $ AgeFirstMarij <int> 17, 17, 17, NA, 18, NA, NA, 13, 13, 13, NA, 1...
## $ RegularMarij <fct> No, No, No, NA, No, NA, NA, No, No, No, NA, Y...
## $ AgeRegMarij <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2...
## $ HardDrugs <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, N...
## $ SexEver <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes...
## $ SexAge <int> 16, 16, 16, NA, 12, NA, NA, 13, 13, 13, 17, 2...
## $ SexNumPartnLife <int> 8, 8, 8, NA, 10, NA, NA, 20, 20, 20, 15, 7, 1...
## $ SexNumPartYear <int> 1, 1, 1, NA, 1, NA, NA, 0, 0, 0, NA, 1, 1, NA...
## $ SameSex <fct> No, No, No, NA, Yes, NA, NA, Yes, Yes, Yes, N...
## $ SexOrientation <fct> Heterosexual, Heterosexual, Heterosexual, NA,...
## $ PregnantNow <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
# Bar plot of Home Ownership by Gender
ggplot(NHANES, aes(x = Gender, fill = HomeOwn)) +
geom_bar(position = "fill") +
ylab("Relative frequencies")
# Density plot of SleepHrsNight colored by SleepTrouble
ggplot(NHANES, aes(x = SleepHrsNight, color = SleepTrouble)) +
geom_density(adjust = 2) +
facet_wrap(~ HealthGen)
## Warning: Removed 2245 rows containing non-finite values (stat_density).
Calculate the difference in home ownership proportions for males and females. Our statistic Males % minus female % is -.0078.
library(infer)
## Warning: package 'infer' was built under R version 3.4.4
homes <- NHANES %>%
select(Gender, HomeOwn) %>%
filter(HomeOwn %in% c("Own", "Rent"))
diff_orig <- homes %>%
group_by(Gender) %>%
summarize(prop_own = mean(HomeOwn == "Own")) %>%
summarize(obs_diff_prop = diff(prop_own)) # male - female
diff_orig
## # A tibble: 1 x 1
## obs_diff_prop
## <dbl>
## 1 -0.00783
Model natural variability (the null distribution) by shuffling observations to remove any relationships that might exist in the population.
Use the infer
package to model \(H_0\) of no relationship between HomeOwn
and Gender
. Randomize the data to calculate permuted statistics.
1. Specify the model with specify
, defining the success condition for the proportion.
2. Set the null hypothesis with hypothesize
.
3. Generate reps
permutations of the data with generate
.
4. Calculate summary statistics with calculate
This process ensures that there is no relationship between home ownership and gender, so any difference in home ownership proportion is due only to natural variability.
homeown_perm <- homes %>%
specify(HomeOwn ~ Gender, success = "Own") %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in props",
order = c("male", "female"))
ggplot(homeown_perm, aes(x = stat)) +
# geom_dotplot(binwidth = 0.001) +
geom_density()
The observed difference of -0.0078 falls below the bulk of the density of shuffled differences. How many randomly permuted differences were as extreme as the observed difference.
homeown_perm <- homeown_perm %>%
mutate(diff_perm = stat) %>%
mutate(stat = NULL)
homeown_perm$diff_orig <- rep(diff_orig[[1,1]], nrow(homeown_perm))
# Plot permuted differences, diff_perm
ggplot(homeown_perm, aes(x = diff_perm)) +
geom_density() +
geom_vline(aes(xintercept = diff_orig), color = "red")
homeown_perm %>%
summarize(sum(diff_perm <= diff_orig))
## # A tibble: 1 x 1
## `sum(diff_perm <= diff_orig)`
## <int>
## 1 220
209 permutations produced a difference in proportions more extreme than the measured value, so do not reject \(H_0\). Our data is consistent with the hypothesis of no difference in home ownership across gender.
Consider the study on gender discrimination. Our hypotheses are: \(H_0\): gender and promotion are unrelated variables. \(H_A\): men are more likely to be promoted.
Start with exploratory analysis
library(dplyr)
library(infer)
disc <- readRDS("Data/disc_new.rds")
glimpse(disc)
## Observations: 48
## Variables: 2
## $ promote <fct> promoted, promoted, promoted, promoted, promoted, prom...
## $ sex <fct> male, male, male, male, male, male, male, male, male, ...
# Counts and proportions (as shown in course)
#disc %>%
# count(promote, sex)
#disc %>%
# group_by(sex) %>%
# summarize(promoted_prop = mean(promote == "promoted"))
# Better way: Contingency table
table(disc$promote, disc$sex)
##
## female male
## not_promoted 7 6
## promoted 17 18
options(scipen = 999, digits = 3) # sig digits
# marginal proportion (margin = 2 for cols)
prop.table(table(disc$promote, disc$sex), margin = 2)
##
## female male
## not_promoted 0.292 0.250
## promoted 0.708 0.750
# Calculate difference in proportions, male - female
diff_orig <- disc %>%
group_by(sex) %>%
summarize(prop_prom = mean(promote == "promoted")) %>%
summarize(stat = diff(prop_prom)) %>%
pull()
Perform the permutation. To quantify the extreme permuted (null) differences, use the quantile() function. The p-value is the probability of observing data as or more extreme given that the null hypothesis is true.
library(ggplot2)
# Replicate the data frame, permuting the promote variable
disc_perm <- disc %>%
specify(promote ~ sex, success = "promoted") %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in props", order = c("male", "female"))
ggplot(disc_perm, aes(x = stat)) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = diff_orig), color = "red")
disc_perm %>%
summarize(
q.01 = quantile(stat, p = 0.01),
q.05 = quantile(stat, p = 0.05),
q.10 = quantile(stat, p = 0.10),
q.90 = quantile(stat, p = 0.90),
q.95 = quantile(stat, p = 0.95),
q.99 = quantile(stat, p = 0.99)
)
## # A tibble: 1 x 6
## q.01 q.05 q.10 q.90 q.95 q.99
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -0.292 -0.208 -0.125 0.208 0.208 0.292
disc_perm %>%
visualize(obs_stat = diff_orig, direction = "greater")
## Warning: `visualize()` shouldn't be used to plot p-value. Arguments
## `obs_stat`, `obs_stat_color`, `pvalue_fill`, and `direction` are
## deprecated. Use `shade_p_value()` instead.
disc_perm %>%
get_p_value(obs_stat = diff_orig, direction = "greater")
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.518
disc_perm %>%
summarize(p_value = mean(diff_orig <= stat))
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.518
Question: How much confidence in the scientific community did people have in 2016? The answers to this question have been summarized as “High” or “Low” levels of confidence and are stored in the consci
variable.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.4
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v tibble 1.4.2 v forcats 0.3.0
## Warning: package 'forcats' was built under R version 3.4.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x dplyr::between() masks data.table::between()
## x dplyr::combine() masks gdata::combine()
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x dplyr::first() masks gdata::first(), data.table::first()
## x jsonlite::flatten() masks purrr::flatten()
## x lubridate::hour() masks data.table::hour()
## x lubridate::intersect() masks base::intersect()
## x lubridate::isoweek() masks data.table::isoweek()
## x gdata::keep() masks purrr::keep()
## x dplyr::lag() masks stats::lag()
## x dplyr::last() masks gdata::last(), data.table::last()
## x lubridate::mday() masks data.table::mday()
## x lubridate::minute() masks data.table::minute()
## x lubridate::month() masks data.table::month()
## x lubridate::quarter() masks data.table::quarter()
## x lubridate::second() masks data.table::second()
## x lubridate::setdiff() masks base::setdiff()
## x data.table::transpose() masks purrr::transpose()
## x lubridate::union() masks base::union()
## x lubridate::wday() masks data.table::wday()
## x lubridate::week() masks data.table::week()
## x lubridate::yday() masks data.table::yday()
## x lubridate::year() masks data.table::year()
library(dplyr)
library(ggplot2)
load("Data/gss.RData")
glimpse(gss)
## Observations: 50,346
## Variables: 28
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ year <dbl> 1982, 1982, 1982, 1982, 1982, 1982, 1982, 1982, 1982,...
## $ age <fct> 41, 49, 27, 24, 57, 29, 21, 68, 54, 80, 74, 30, 53, 3...
## $ class <fct> WORKING CLASS, WORKING CLASS, MIDDLE CLASS, MIDDLE CL...
## $ degree <fct> LT HIGH SCHOOL, HIGH SCHOOL, HIGH SCHOOL, HIGH SCHOOL...
## $ sex <fct> MALE, FEMALE, FEMALE, FEMALE, MALE, MALE, FEMALE, MAL...
## $ marital <fct> MARRIED, MARRIED, NEVER MARRIED, NEVER MARRIED, NEVER...
## $ race <fct> WHITE, WHITE, WHITE, WHITE, WHITE, WHITE, WHITE, WHIT...
## $ region <fct> NEW ENGLAND, NEW ENGLAND, NEW ENGLAND, NEW ENGLAND, N...
## $ partyid <fct> STRONG DEMOCRAT, STRONG DEMOCRAT, IND,NEAR DEM, IND,N...
## $ happy <fct> PRETTY HAPPY, NOT TOO HAPPY, VERY HAPPY, PRETTY HAPPY...
## $ grass <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ relig <fct> CATHOLIC, CATHOLIC, CATHOLIC, CATHOLIC, CATHOLIC, CAT...
## $ cappun2 <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ cappun <fct> FAVOR, FAVOR, FAVOR, OPPOSE, OPPOSE, FAVOR, OPPOSE, F...
## $ finalter <fct> STAYED SAME, WORSE, BETTER, BETTER, STAYED SAME, BETT...
## $ protest3 <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ natspac <fct> ABOUT RIGHT, TOO MUCH, TOO LITTLE, TOO LITTLE, ABOUT ...
## $ natarms <fct> TOO LITTLE, TOO LITTLE, ABOUT RIGHT, TOO MUCH, TOO LI...
## $ conclerg <fct> ONLY SOME, ONLY SOME, A GREAT DEAL, ONLY SOME, A GREA...
## $ confed <fct> ONLY SOME, ONLY SOME, ONLY SOME, ONLY SOME, A GREAT D...
## $ conpress <fct> ONLY SOME, ONLY SOME, A GREAT DEAL, ONLY SOME, A GREA...
## $ conjudge <fct> HARDLY ANY, ONLY SOME, A GREAT DEAL, A GREAT DEAL, A ...
## $ consci <fct> ONLY SOME, ONLY SOME, A GREAT DEAL, A GREAT DEAL, A G...
## $ conlegis <fct> ONLY SOME, ONLY SOME, ONLY SOME, ONLY SOME, A GREAT D...
## $ zodiac <fct> TAURUS, CAPRICORN, VIRGO, PISCES, CAPRICORN, LEO, LIB...
## $ oversamp <dbl> 1.24, 1.24, 1.24, 1.24, 1.24, 1.24, 1.24, 1.24, 1.24,...
## $ postlife <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
# Collapse levels
# from "A GREAT DEAL" "ONLY SOME" "HARDLY ANY"
# to "High", "Low"
levels(gss$consci)
## [1] "A GREAT DEAL" "ONLY SOME" "HARDLY ANY"
levels(gss$consci) <- c("High", "Low", "Low")
gss2016 <- gss %>%
filter(year == 2016)
ggplot(gss2016, aes(x = consci)) +
geom_bar()
# proportion of high conf.
p_hat <- gss2016 %>%
summarize(p = mean(consci == "High", na.rm = TRUE))
Calculate the standard error To assess uncertainty in this estimate of the number of people that have “High” confidence in the scientific community. Start by considering how different the data might look in just a single bootstrap sample.
library(infer)
# Create single bootstrap data set
b1 <- gss2016 %>%
specify(response = consci, success = "High") %>%
generate(reps = 1, type = "bootstrap")
## Warning: Removed 983 rows containing missing values.
# Plot distribution of consci
ggplot(b1, aes(x = consci)) +
geom_bar()
# Compute proportion with high conf
b1 %>%
summarize(p = mean(consci == "High"))
## # A tibble: 1 x 2
## replicate p
## <int> <dbl>
## 1 1 0.439
Visualize bi-variate continuous relationships with a scatterplot. Or, discretize one variable and create boxplots. The goal of bivariate analysis is to characterize the form (linear, quadratic, nonlinear), direction (positive or negative), strength (scatter), and outliers.
library(openintro)
library(dplyr)
library(ggplot2)
data(ncbirths)
ggplot(ncbirths, aes(x = weeks, y = weight)) +
geom_point()
## Warning: Removed 2 rows containing missing values (geom_point).
ggplot(data = ncbirths,
aes(x = cut(weeks, breaks = 5), y = weight)) +
geom_boxplot()
Use data transformation to create linear relationship where raw data does not reveal it. When values have a wide range, transform with log.
# Original
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
geom_point()
# Scatterplot with coord_trans()
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
geom_point() +
coord_trans(x = "log10", y = "log10")
# Scatterplot with scale_x_log10() and scale_y_log10()
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
geom_point() +
scale_x_log10() + scale_y_log10()
Use alpha shading and/or jittering to address overplotting of integer variables. Identify outliers and note how the relationship between two variables may change as a result of removing them. Be careful with rate statistics since they mask the underlying number of observations.
Correlation is a numeric measure of the degree of linear relationship between two variables. It is defined
\(r(x,y) = Cov(x,y) / \sqrt(SXX * SYY)\)
The cor(x, y)
function will compute the Pearson product-moment correlation between variables, x
and y
. Specify use = "pairwise.complete.obs"
when cols may include missing data.
ncbirths %>%
summarize(N = n(), r = cor(weight, weeks, use = "pairwise.complete.obs"))
## N r
## 1 1000 0.67
The simple linear regression model minimizes the sum of squared residuals. Linear regression is a specific example of a larger class of smooth models. The geom_smooth()
function allows you to draw such models over a scatterplot of the data itself. This technique is known as visualizing the model in the data space. The method argument to geom_smooth()
allows you to specify what class of smooth model you want to see. Since we are exploring linear models, we’ll set this argument to the value "lm"
.
ggplot(data = bdims, aes(x = wgt, y = hgt)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
The lm
function creates an object of type lm
. There are functions to extract everything about the model, including fitted.values()
and residuals()
. The broom
package can conver the lm
object into a tidy data frame with the augment()
function.
library(broom)
mod <- lm(hgt ~ wgt, data = bdims)
coef(mod)
## (Intercept) wgt
## 136.182 0.506
summary(mod)
##
## Call:
## lm(formula = hgt ~ wgt, data = bdims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.716 -3.878 0.008 4.653 18.688
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 136.1819 1.5391 88.5 <0.0000000000000002 ***
## wgt 0.5056 0.0219 23.1 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.56 on 505 degrees of freedom
## Multiple R-squared: 0.515, Adjusted R-squared: 0.514
## F-statistic: 535 on 1 and 505 DF, p-value: <0.0000000000000002
mod_tidy <- augment(mod)
Make predictions by pairing the lm model with a new data frame. These types of predictions are called out-of-sample.
mod <- lm(wgt ~ hgt, data = bdims)
ben <- data.frame(wgt = c(74.8), hgt = c(182.8))
predict(mod, newdata = ben)
## 1
## 81
The geom_smooth() function makes it easy to add a simple linear regression line to a scatterplot of the corresponding variables. And in fact, there are more complicated regression models that can be visualized in the data space with geom_smooth(). However, there may still be times when we will want to add regression lines to our scatterplot manually. To do this, we will use the geom_abline() function, which takes slope and intercept arguments.
mod <- lm(wgt ~ hgt, data = bdims)
coefs <- as.data.frame(coef(mod))
ggplot(data = bdims, aes(x = hgt, y = wgt)) +
geom_point() +
geom_abline(aes(intercept = coef(mod)["(Intercept)"], slope = coef(mod)["hgt"]),
color = "dodgerblue")
The root mean squared error (RMSE), aka, residual standard error, defined as the square root of the sum of squared residuals divided by the degrees of freedom, roughly measures how far the predicted values are from the observed value, expressed in the same units as the the observed value.
\(R^2 - 1 - SSE/SST\) where SST
is the some of squared errors of the null model y ~ 1
(which equals \(\bar y\). I.e., \(R^2 = 1 - var(resid) / var(y)\).
Leverage is defined as distance from \(\bar{x}\). Cooke’s distance combines the residual with the leverage score.
mod %>%
augment() %>%
arrange(desc(.hat)) %>%
head()
## wgt hgt .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
## 1 85.5 198 96.6 1.26 -11.08 0.0182 9.30 0.013373 -1.201
## 2 90.9 197 95.6 1.21 -4.66 0.0170 9.31 0.002208 -0.505
## 3 49.8 147 44.8 1.13 5.02 0.0148 9.31 0.002212 0.543
## 4 80.7 194 91.9 1.07 -11.20 0.0131 9.30 0.009758 -1.211
## 5 95.9 193 91.4 1.05 4.51 0.0126 9.32 0.001523 0.488
## 6 44.8 150 47.1 1.04 -2.32 0.0124 9.32 0.000397 -0.251
library(ggplot2)
library(dplyr)
options(scipen = 999, digits = 2) # sig digits
n = 1:20
density = dbinom(x = 3, size = 1:20, 0.3)
data.frame(n, density) %>%
ggplot(aes(x = n, y = density)) +
geom_col() +
geom_text(
aes(label = round(density,2), y = density + 0.01),
position = position_dodge(0.9),
size = 3,
vjust = 0
) +
labs(title = "P(X = 3) in n Bernoulli trials where p = 0.3",
subtitle = "The distribution of at bats for a .300 hitter to get 3 hits peaks at 10.",
x = "trial number (n)",
y = "Density")
To conduct a hypothesis test, draw the sampling distribution and shade the p-value. Calculate the test statistic and reject \(H_0\) if the p-value is less than significance level \(\alpha\).
A sample of n = 2,500 students finds y = 1,200 binge drink, a proportion of p = 0.48. With 95% confidence, is this greater than the \(\pi = 0.44\)* national average? \(H_a\) is \(\pi > 0.44\), so \(H_0\) is \(pi < 0.44\). Reject \(H_0\) if \(z = (p - \pi_0)/SE_0\) is greater than \(z_{.05???2} = 1.96\).*
library(ggplot2)
n = 2500
y = 1200
p = y / n
pi = 0.44
se = sqrt(p * (1 - p) / n)
se_0 = sqrt(pi * (1 - pi) / n)
rr = qnorm(p = 0.975, mean = pi, sd = se)
z = (p - pi) / se_0
z_crit = qnorm(p = 0.975, mean = 0, sd = 1)
dat <- data.frame(prob = rnorm(n = 1000, mean = pi, sd = se_0))
ci_lo <- qnorm(.025, mean = p, sd = se) # same as p - z_crit * SE
ci_hi <- qnorm(.025, mean = p, sd = se) # same as p + z_crit * SE
cat("95% confidence interval is (", ci_lo, ",", ci_hi, ")")
## 95% confidence interval is ( 0.46 , 0.46 )
ggplot(dat, aes(x = prob)) +
geom_density() +
geom_vline(aes(xintercept = p), color = "red") +
geom_rect(
aes(xmin = rr, xmax = +Inf, ymin = -Inf, ymax = +Inf),
inherit.aes = FALSE, fill = "red", alpha = 0.05) +
labs(title = "Null Normal Sampling Probability Distribution",
subtitle = "Rejection region and sample statistic shown in red",
x = "Proportion",
y = "Density")
The one sample proportion comparison test compares measured proportion \(p\) to the hypothesized population proportion \(\pi_0\) with null hypothesis \(H_0:\pi=\pi_0\).
Use the exact binomial probability comparison test for small samples.
What is the probability of calculating \(p>.95\)* from a sample of \(n=200\) when \(\pi=.90\)?*
sum(dbinom(x = 190:200, size = 200, prob = .90))
## [1] 0.0081
library(ggplot2)
n = 200
y = 190
p = y / n
pi = 0.90
rr = qbinom(p = 0.975, size = n, prob = pi)
dat <- data.frame(mydist = rbinom(n = 1000, size = n, prob = pi))
ci_lo <- qbinom(.025, size = n, prob = p)
ci_hi <- qbinom(.975, size = n, prob = p)
cat("95% confidence interval is (", ci_lo, ",", ci_hi, ")")
## 95% confidence interval is ( 184 , 196 )
ggplot(dat, aes(x = mydist)) +
geom_density() +
geom_vline(aes(xintercept = y), color = "red") +
geom_rect(
aes(xmin = rr, xmax = +Inf, ymin = -Inf, ymax = +Inf),
inherit.aes = FALSE, fill = "red", alpha = 0.05) +
labs(title = "Null Binomial Sampling Probability Distribution",
subtitle = "Rejection region and sample statistic shown in red",
x = "Count",
y = "Density")
Use a Wald confidence interval when the binomial distribution is approximately normal.
A maintenance crew resolves y = 33 of n = 50 repair requests within 24 hours, a proportion of p = 0.66. With 95% confidence, what proportion of repair requests does the maintenance crew resolve within 24 hours?
n = 50
y = 33
p = y / n
se = sqrt(p * (1 - p) / n)
ci_lo <- qnorm(.025, mean = p, sd = se) # same as p - z_crit * SE
ci_hi <- qnorm(.975, mean = p, sd = se) # same as p + z_crit * SE
cat("95% confidence interval is (", ci_lo, ",", ci_hi, ")")
## 95% confidence interval is ( 0.53 , 0.79 )
If the normality condition does not hold, use the Wilson-Agresti-Coull (WAC) confidence interval instead.
A maintenance crew resolves y = 43 of n = 50 repair requests within 24 hours, a proportion of p = 0.86. With 95% confidence, what proportion of repair requests does the maintenance crew resolve within 24 hours?
n = 50
y = 43
z_crit = qnorm(p = 0.975, mean = 0, sd = 1)
p = (y + 0.5 * z_crit^2) / (n + z_crit^2)
se = sqrt(p * (1 - p) / n)
me = z_crit * se
ci_lo <- qnorm(.025, mean = p, sd = se) # same as p - z_crit * SE
ci_hi <- qnorm(.975, mean = p, sd = se) # same as p + z_crit * SE
cat("95% confidence interval is", p, "+/-", me, ", (", ci_lo, ",", ci_hi, ").")
## 95% confidence interval is 0.83 +/- 0.1 , ( 0.73 , 0.94 ).
If p ~ 0 or p ~ 1, then set the extreme end of the probability distribution to \((\alpha/2) ^{(1/n)}\).
A maintenance crew resolves y = 50 of n = 50 repair requests within 24 hours, a proportion of p = 1.00. With 95% confidence, what proportion of repair requests does the maintenance crew resolve within 24 hours?
n = 50
y = 50
z_crit = qnorm(p = 0.975, mean = 0, sd = 1)
ci_lo = (.05 / 2)^(1/n)
ci_hi = 1
cat("95% confidence interval is (", ci_lo, ",", ci_hi, ").")
## 95% confidence interval is ( 0.93 , 1 ).
The two-sample proportion test compares two sample proportions.