Load libraries wither with library() or require().

R Basics

String construction.

dog <- "Chester"
print(paste("you are a dog", dog))
## [1] "you are a dog Chester"
nchar(dog)
## [1] 7

Vectors

Create a vector with the combine function c(). Reference vector elements with brackets, or with element names. R compares vectors element-wise. If you compare a vector to a singe value, R will create an appropriately sized vector.

There are two types of vectors in R: atomic vectors, and lists. Atomic vectors are homogenous of one of six types: logical, integer, double, character, complex, and raw (don’t worry about the relatively uncommon complex and raw types). Lists are recursive vectors (they can contain other lists).

Vectors have two key properties: type typeof() of length length(). Subset a list with single brackets and extract elements with double brackets. For example,

a <- list(
  a = 1:3,
  b = "a string",
  c = pi,
  d = list(-1, -5)
)
# List d.
typeof(a[4])
## [1] "list"
# The two elements of list d.
typeof(a[[4]])
## [1] "list"
# The first element of list d.
typeof(a[[4]][1])
## [1] "list"
# The first value of list d
typeof(a[[4]][[1]])
## [1] "double"
numeric_vector <- c(1, 10, 49)
character_vector <- c("a", "b", "c")
boolean_vector <- c(TRUE, FALSE, TRUE)
character_vector[1]
## [1] "a"
boolean_vector[c(2,3)]
## [1] FALSE  TRUE
boolean_vector[2:3]
## [1] FALSE  TRUE
roulette_vector <- c(-24, -50, 100, -350, 10)
names(roulette_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
roulette_vector[1]
## Monday 
##    -24
roulette_vector["Monday"]
## Monday 
##    -24
# vector operations
sum(roulette_vector)
## [1] -314
mean(roulette_vector)
## [1] -62.8
# take a subset of a vector using booleans
roulette_vector[roulette_vector>0]
## Wednesday    Friday 
##       100        10

Matrix

A matrix is a two-dimensional collection of elements. Create a matrix with the matrix(data, nrow, ncol, byrow) function. Label the rows with rownames() and the columns with colnames(). Sum each row and column into vectors with rowSums() and colSums(). Bind rows and columns to a matrix with rbind() and cbind(). Reference matrix items with brackets [row, col].

# Matrix of numbers 1:20, filling one row at a time, for 5 rows and 4 columns.  Specifying the number of columns is optional if number of rows is specified.
m <- matrix(1:20, byrow = TRUE, nrow = 5, ncol = 4)
rownames(m) <- c("row 1", "row 2", "row 3", "row 4", "row 5")
colnames(m) <- c("Col 1", "col 2", "col 3", "col 4")
m
##       Col 1 col 2 col 3 col 4
## row 1     1     2     3     4
## row 2     5     6     7     8
## row 3     9    10    11    12
## row 4    13    14    15    16
## row 5    17    18    19    20
# Bind row sums to matrix.
m.rowSum <- rowSums(m)
cbind(m, m.rowSum)
##       Col 1 col 2 col 3 col 4 m.rowSum
## row 1     1     2     3     4       10
## row 2     5     6     7     8       26
## row 3     9    10    11    12       42
## row 4    13    14    15    16       58
## row 5    17    18    19    20       74
# All rows of the second colum of m.
m[,2]
## row 1 row 2 row 3 row 4 row 5 
##     2     6    10    14    18

Use nrows() and ncols() to determine number of rows and columns.

for (i in 1:nrow(m)) {
  for (j in 1:ncol(m)) {
    print(paste("On row ", i, " and column ", j, " the matrix contains ", m[i,j]))
  }
}
## [1] "On row  1  and column  1  the matrix contains  1"
## [1] "On row  1  and column  2  the matrix contains  2"
## [1] "On row  1  and column  3  the matrix contains  3"
## [1] "On row  1  and column  4  the matrix contains  4"
## [1] "On row  2  and column  1  the matrix contains  5"
## [1] "On row  2  and column  2  the matrix contains  6"
## [1] "On row  2  and column  3  the matrix contains  7"
## [1] "On row  2  and column  4  the matrix contains  8"
## [1] "On row  3  and column  1  the matrix contains  9"
## [1] "On row  3  and column  2  the matrix contains  10"
## [1] "On row  3  and column  3  the matrix contains  11"
## [1] "On row  3  and column  4  the matrix contains  12"
## [1] "On row  4  and column  1  the matrix contains  13"
## [1] "On row  4  and column  2  the matrix contains  14"
## [1] "On row  4  and column  3  the matrix contains  15"
## [1] "On row  4  and column  4  the matrix contains  16"
## [1] "On row  5  and column  1  the matrix contains  17"
## [1] "On row  5  and column  2  the matrix contains  18"
## [1] "On row  5  and column  3  the matrix contains  19"
## [1] "On row  5  and column  4  the matrix contains  20"

Factors

The factor() function converts a variable into type factor. R needs to know whether a variable is continuous or categorical. To specify an ordinal categorical variable, specify order = TRUE and levels.

student_status <- c("student", "not student", "student", "not student")
categorical_student <- factor(student_status)
categorical_student
## [1] student     not student student     not student
## Levels: not student student
temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
temperature_vector
## [1] "High"   "Low"    "High"   "Low"    "Medium"
# nominal variables are not comparable, but ordinal variables are.
temperature_vector[1] > temperature_vector[2]
## [1] FALSE
factor_temperature_vector[1] > factor_temperature_vector[2]
## [1] TRUE
# Change the level names with the levels function.  Note the levels are initially in alphabetical order.
survey_vector <- c("M", "F", "F", "M", "M")
factor_survey_vector <- factor(survey_vector)
levels(factor_survey_vector) <- c("Female", "Male")

# Notice how summary treats a factor variable different from a regular variable.
summary(survey_vector)
##    Length     Class      Mode 
##         5 character character
summary(factor_survey_vector)
## Female   Male 
##      2      3

Data Frames

A dataframe is like a matrix, except each column can be a different data type. Several functions inspect data frames. * head (tail): by default prints the first (last) 6 rows of the dataframe * str: prints the structure of the dataframe. Probably the first function you’ll call with a new data set. * dim: prints the dimensions of the dataframe * colnames: prints the column names of the dataframe * na.omit() removes rows with NA in any column.

mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
head(mtcars,6)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
dim(mtcars)
## [1] 32 11
colnames(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

Create a data frame with the data.frame() function.

planets <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
planets_df <- data.frame(planets, type, diameter, rotation, rings)
# Select first 5 values of diameter column.  The $ is a short-cut method.
planets_df[1:5,"diameter"]
## [1]  0.382  0.949  1.000  0.532 11.209
planets_df$diameter[1:5]
## [1]  0.382  0.949  1.000  0.532 11.209

Use subset() to apply a where condition to the data frame rows. User order() to apply an order by to the data frame.

subset(planets_df, subset = diameter < 1)
##   planets               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 2   Venus Terrestrial planet    0.949  -243.02 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE
planets_df[order(planets_df$diameter),]
##   planets               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE
## 2   Venus Terrestrial planet    0.949  -243.02 FALSE
## 3   Earth Terrestrial planet    1.000     1.00 FALSE
## 8 Neptune          Gas giant    3.883     0.67  TRUE
## 7  Uranus          Gas giant    4.007    -0.72  TRUE
## 6  Saturn          Gas giant    9.449     0.43  TRUE
## 5 Jupiter          Gas giant   11.209     0.41  TRUE

Lists

Construct a list of objects with list(). Name the list items either with “=” at creation, or using names().

my_vector <- 1:10 
my_matrix <- matrix(1:9, ncol = 3)
my_df <- mtcars[1:10,]

my_list <- list(my_vector, my_matrix, my_df)
names(my_list) <- c("vec", "mat", "df")
my_list
## $vec
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $mat
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## 
## $df
##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
# or
my_list2 <- list(vec = my_vector, mat = my_matrix, df = my_df)
my_list2
## $vec
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $mat
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## 
## $df
##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Reference items in a list by its component number in brackets, or name in brackets, or name after a dollar sign.

my_vector <- 1:10 
my_matrix <- matrix(1:9, ncol = 3)
my_df <- mtcars[1:10,]
my_list <- list(vec = my_vector, mat = my_matrix, df = my_df)

# Third col of second element of my_list (my_matrix)
my_list[[2]][,3]
## [1] 7 8 9
my_list$mat[,3]
## [1] 7 8 9

Append to a list with combine c().

my_vector <- 1:10 
my_matrix <- matrix(1:9, ncol = 3)
my_df <- mtcars[1:10,]
my_list <- list(vec = my_vector, mat = my_matrix, df = my_df)
my_list <- c(my_list, df2 = my_df)

Intermediate R

Conditionals

Relational operators are == and !=. Logical operators are &, |, and !. Be careful to not use && or !! - they evaluate only the first item in the list! Control constructs are if().

x <- 3
if (x %% 2 == 0) {
  print("x is divisible by 2")
} else if (x %% 3 == 0) {
  print("x is divisible by 3")
} else {
  print("x is divisible by neither 2 nor 3")
}
## [1] "x is divisible by 3"

Loops

While loop is while() {}. Break out of loop early with if (condition) { break()}.

i <- 1
while (i <= 10) {
  print(3 * i)
  if (3 * i %% 8 == 0) {
    break()
  }
  i <- i + 1
}
## [1] 3
## [1] 6
## [1] 9
## [1] 12
## [1] 15
## [1] 18
## [1] 21
## [1] 24

For loop is for(var in seq) {exp}. The break statement abandons the active loop. The next statement skips the rest of the statements in the current loop interation.

linkedin <- c(16, 9, 13, 5, 2, 17, 14)

# Loop version 1
for(views in linkedin) {
  print(views)
  if (views > 10) {
    break
  } else if (view < 5) {
    next
  }
}
## [1] 16
# Loop version 2
for(i in 1:length(linkedin)) {
  print(linkedin[i])
}
## [1] 16
## [1] 9
## [1] 13
## [1] 5
## [1] 2
## [1] 17
## [1] 14
# seq_along handles zero-length vectors and lists.
for (i in seq_along(linkedin)) {
  print(linkedin[i])
}
## [1] 16
## [1] 9
## [1] 13
## [1] 5
## [1] 2
## [1] 17
## [1] 14

Functions

Get help on a function with help or ?, or args. Specify function parameters either by name or by position. When the documentation specifies default values, they are not required.

#help(mean)
#?mean
args(mean)
## function (x, ...) 
## NULL
grades <- c(8.5, 7, 9, 5.5, 6)
mean(x=grades)
## [1] 7.2
mean(grades)
## [1] 7.2

Define a custom function with the function() code chunk. The return statement returns and exits immediately and is optional. Set default argument value with =.

multiply_a_b <- function(a, b = 1) {
  return (a * b)
}
result <- multiply_a_b(a = 3, b = 7)

Install a package with install.packages(arg). Packages are located at the Comprehensive R Archive Network (CRAN). Search for packages with search(). R attaches seven packages to its search list by default. Attach more packages with library() or require().

The Apply Family

Function lapply(X, FUN, ...) applies a function to a list. lapply() returns a list, so if X is a vector, cast the function result back to list with unlist. If the function requires arguments, pass them in as additional arguments to lapply(). Functions can be named or anonymous, so if used only once, define the function within lapply().

lapply(list(1,2,3), function(x) { 3 * x })
## [[1]]
## [1] 3
## 
## [[2]]
## [1] 6
## 
## [[3]]
## [1] 9

Function sapply() calls lapply() then converts the list to a one-dimensional array (vector) or two-dimensional array (matrix). If sapply cannot simplify because the resulting list contains vectors of varying lengths, then sapply() returns the same result as lapply().

Function vapply() uses lapply() but with FUN.VALUE which indicates the return variable type. vapply() is a safe alternative to sapply().

purrr Package

The purrr package maps functions to a vector and return a vector. map() returns a list; the others are map_dbl(), map_lgl(), map_int(), and map_chr(). The purrr functions provide shortcuts for the f argument, are more consistant than lapply and sapply, and handle iteration well.

library(purrr)
## Warning: package 'purrr' was built under R version 3.4.4
cyl <- split(mtcars, mtcars$cyl)
# Regress mpg ~ wt on each cylinder class
map(cyl, function(df) lm(mpg ~ wt, data = df))
## $`4`
## 
## Call:
## lm(formula = mpg ~ wt, data = df)
## 
## Coefficients:
## (Intercept)           wt  
##      39.571       -5.647  
## 
## 
## $`6`
## 
## Call:
## lm(formula = mpg ~ wt, data = df)
## 
## Coefficients:
## (Intercept)           wt  
##       28.41        -2.78  
## 
## 
## $`8`
## 
## Call:
## lm(formula = mpg ~ wt, data = df)
## 
## Coefficients:
## (Intercept)           wt  
##      23.868       -2.192
# Same thing with shortcuts
models <- map(cyl, ~ lm(mpg ~ wt, data = .))
coefs <- map(models, coef)
map(coefs, "wt")
## $`4`
## [1] -5.647025
## 
## $`6`
## [1] -2.780106
## 
## $`8`
## [1] -2.192438
# Or, using a single command with pipes.
mtcars %>% 
  split(mtcars$cyl) %>%
  map(~ lm(mpg ~ wt, data = .)) %>%
  map(coef) %>% 
  map_dbl("wt")
##         4         6         8 
## -5.647025 -2.780106 -2.192438

The safely() function returns a list with two elements: result and error for each element. possibly() returns a default value on errors. quietly() captures all printed output, messages, and warnings instead of capturing errors.

safe_readLines <- safely(readLines())

# Call safe_readLines() on "http://example.org"
example_lines <- safe_readLines("http://example.org")
example_lines
## $result
## NULL
## 
## $error
## NULL
# Call safe_readLines() on "http://asdfasdasdkfjlda"
nonsense_lines <- safe_readLines("http://asdfasdasdkfjlda")
nonsense_lines
## $result
## NULL
## 
## $error
## NULL
n <- list(5, 10, 20)
mu <- list(1, 5, 10)
sd <- list(0.1, 1, 0.1)

# iterate over the lists
pmap(list(n, mu, sd), rnorm)
## [[1]]
## [1] 1.0380868 0.9605489 1.0786154 1.0073599 1.0234126
## 
## [[2]]
##  [1] 4.343431 6.307386 3.939620 3.125216 7.622740 5.457172 5.548574
##  [8] 4.371869 4.627905 5.260454
## 
## [[3]]
##  [1] 10.053020 10.053259 10.119406  9.824395  9.995872  9.749677  9.997900
##  [8] 10.128129 10.115909 10.197187 10.031033 10.080599  9.935449 10.055783
## [15] 10.083899  9.935934  9.781156 10.215975 10.060304 10.016733
funs <- list("rnorm", "runif", "rexp")

rnorm_params <- list(mean = 10)
runif_params <- list(min = 0, max = 5)
rexp_params <- list(rate = 5)
params <- list(
  rnorm_params,
  runif_params,
  rexp_params
)

# Call invoke_map() on funs supplying params and setting n to 5
invoke_map(funs, params, n = 5)
## [[1]]
## [1]  9.657600 12.019679 10.136912 11.521788  9.658688
## 
## [[2]]
## [1] 1.0613833 2.0008371 1.4973380 2.9227932 0.3804437
## 
## [[3]]
## [1] 0.07188987 0.07739475 0.03476835 0.33302093 0.17282787

walk() operates just like map() except it’s designed for functions that don’t return anything. Use walk() for functions with side effects like printing, plotting or saving.

#?walk2

stopifnot() is a quick way to stop a function stop if a condition fails. stopifnot() takes logical expressions as arguments and looks for any to be FALSE.

x <- c(NA, NA, NA)
y <- c( 1, NA, NA, NA)

both_na <- function(x, y) {
  stopifnot(length(x) == length(y))
  sum(is.na(x) & is.na(y))
}
#both_na(x, y)

Use stop() instead of stopifnot() to specify a more informative error message.

x <- c(NA, NA, NA)
y <- c( 1, NA, NA, NA)

both_na <- function(x, y) {
  if (length(x) != length(y)) {
    stop("x and y must have the same length", call. = FALSE)
  }    
  sum(is.na(x) & is.na(y))
}
#both_na(x, y)

Useful Functions

R features a bunch of functions to juggle around with data structures:: seq(from = 1, to 2, by = .25): Generates sequence from 1 to 2 incremented by .25. rep(x, times): Replicates elements of vectors and lists. sort(x): Sorts a vector. rev(x): Reverses the elements in a data structures for which reversal is defined. str(x): Display the structure of any R object x. append(x y): Appends vectors or list y to x. is.*(): Checks class of R object x. as.*(): Casts R object x. unlist(x): Flatten (possibly embedded) lists to produce a vector.

myseq <- seq(8, 2, by=-2)
myseq
## [1] 8 6 4 2
myrep <- rep(myseq, times =2)
myrep
## [1] 8 6 4 2 8 6 4 2
myrep <- rep(myseq, each = 2)
myrep
## [1] 8 8 6 6 4 4 2 2
linkedin <- list(16, 9, 13, 5, 2, 17, 14)
facebook <- list(17, 7, 5, 16, 8, 13, 14)
li_vec <- unlist(linkedin)
fb_vec <- unlist(facebook)
social_vec <- append(li_vec, fb_vec)
sort(social_vec, decreasing = TRUE)
##  [1] 17 17 16 16 14 14 13 13  9  8  7  5  5  2

Regular expressions include grepl() grepl(pattern = "a", x = animals) returns TRUE for each element of x matching the pattern. Regular expression “^a” means a*; “a$” means *a; .\* means any character zero or more times; ’\smeans space;[0-9]+means numbers 0 to 9 at least once.grep(pattern = “a”, x = animals)returns the vector indices for each element ofxmatching thepattern.sub(pattern = “a”, replacement = “o”, x = animals“)substitutes the first a with o.gsum(pattern =”a“, replacement =”o“, x = animals”)` substitutes all a’s with o’s.)

animals <- c("cat", "moose", "impala", "ant", "kiwi")
grepl(pattern = "a", x = animals)
## [1]  TRUE FALSE  TRUE  TRUE FALSE
which(grepl(pattern = "a", x = animals))
## [1] 1 3 4
grep(pattern = "a", x = animals)
## [1] 1 3 4

There are two datetimes in R, POSIXlt, a list with named components, and POSIXct, the number of seconds since 1970-01-01 00:00:00. POSIXct is more amenable to data frames, so you will encounter it much more often. Sys.Date() returns a Date equal to today. Sys.time() returns POSIXct.

as.Date("2018-10-16")
## [1] "2018-10-16"
as.POSIXct("2018-11-28 08:34:00")
## [1] "2018-11-28 08:34:00 EST"

Importing Data

RData

The simplest file to import is RData.

url_rdata <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData"
download.file(url_rdata, "Programs/Data/wine_local.RData")
# loading wine_local.RData creates variable wine.
load("Programs/Data/wine_local.RData")
summary(wine)
##     Alcohol        Malic acid        Ash        Alcalinity of ash
##  Min.   :11.03   Min.   :0.74   Min.   :1.360   Min.   :10.60    
##  1st Qu.:12.36   1st Qu.:1.60   1st Qu.:2.210   1st Qu.:17.20    
##  Median :13.05   Median :1.87   Median :2.360   Median :19.50    
##  Mean   :12.99   Mean   :2.34   Mean   :2.366   Mean   :19.52    
##  3rd Qu.:13.67   3rd Qu.:3.10   3rd Qu.:2.560   3rd Qu.:21.50    
##  Max.   :14.83   Max.   :5.80   Max.   :3.230   Max.   :30.00    
##    Magnesium      Total phenols     Flavanoids    Nonflavanoid phenols
##  Min.   : 70.00   Min.   :0.980   Min.   :0.340   Min.   :0.1300      
##  1st Qu.: 88.00   1st Qu.:1.740   1st Qu.:1.200   1st Qu.:0.2700      
##  Median : 98.00   Median :2.350   Median :2.130   Median :0.3400      
##  Mean   : 99.59   Mean   :2.292   Mean   :2.023   Mean   :0.3623      
##  3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.860   3rd Qu.:0.4400      
##  Max.   :162.00   Max.   :3.880   Max.   :5.080   Max.   :0.6600      
##  Proanthocyanins Color intensity       Hue           Proline      
##  Min.   :0.410   Min.   : 1.280   Min.   :1.270   Min.   : 278.0  
##  1st Qu.:1.250   1st Qu.: 3.210   1st Qu.:1.930   1st Qu.: 500.0  
##  Median :1.550   Median : 4.680   Median :2.780   Median : 672.0  
##  Mean   :1.587   Mean   : 5.055   Mean   :2.604   Mean   : 745.1  
##  3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:3.170   3rd Qu.: 985.0  
##  Max.   :3.580   Max.   :13.000   Max.   :4.000   Max.   :1680.0
# or, equivalently,
load(url("https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData"))
summary(wine)
##     Alcohol        Malic acid        Ash        Alcalinity of ash
##  Min.   :11.03   Min.   :0.74   Min.   :1.360   Min.   :10.60    
##  1st Qu.:12.36   1st Qu.:1.60   1st Qu.:2.210   1st Qu.:17.20    
##  Median :13.05   Median :1.87   Median :2.360   Median :19.50    
##  Mean   :12.99   Mean   :2.34   Mean   :2.366   Mean   :19.52    
##  3rd Qu.:13.67   3rd Qu.:3.10   3rd Qu.:2.560   3rd Qu.:21.50    
##  Max.   :14.83   Max.   :5.80   Max.   :3.230   Max.   :30.00    
##    Magnesium      Total phenols     Flavanoids    Nonflavanoid phenols
##  Min.   : 70.00   Min.   :0.980   Min.   :0.340   Min.   :0.1300      
##  1st Qu.: 88.00   1st Qu.:1.740   1st Qu.:1.200   1st Qu.:0.2700      
##  Median : 98.00   Median :2.350   Median :2.130   Median :0.3400      
##  Mean   : 99.59   Mean   :2.292   Mean   :2.023   Mean   :0.3623      
##  3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.860   3rd Qu.:0.4400      
##  Max.   :162.00   Max.   :3.880   Max.   :5.080   Max.   :0.6600      
##  Proanthocyanins Color intensity       Hue           Proline      
##  Min.   :0.410   Min.   : 1.280   Min.   :1.270   Min.   : 278.0  
##  1st Qu.:1.250   1st Qu.: 3.210   1st Qu.:1.930   1st Qu.: 500.0  
##  Median :1.550   Median : 4.680   Median :2.780   Median : 672.0  
##  Mean   :1.587   Mean   : 5.055   Mean   :2.604   Mean   : 745.1  
##  3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:3.170   3rd Qu.: 985.0  
##  Max.   :3.580   Max.   :13.000   Max.   :4.000   Max.   :1680.0

Flat files

There are three common packages designed to load flat files: util which comes with base r, readr, and data.table.

util

The base r util package includes flat file reading functions. read.table() is a generic flat file loading function. Wrapper functions read.csv() reads comma-separated files, and read.delim reads tab-delimited files.

  • stringsAsFactors = TRUE treats string variables as categorical.
  • col.names = c() overrides, or sets, column names.
  • colClasses = c() sets data types. NULL elements in the vector drop the variable.
# Opt 1: set working dir to file location
# setwd("C:/Users/mpfol/OneDrive/Documents/Data Analysis/Data")
# Opt 2: define a file path relative to script file.
path <- file.path("Data", "swimming_pools.csv")

swimming_pools <- read.csv(path, stringsAsFactors = FALSE)

swimming_pools <- read.table(path, 
                             sep = ",",
                             header = TRUE,
                             col.names = c("name", "address", "ph", "ph2", "open_hr","facilities", "disabl","park","lat","longit"),
                             colClasses = c("factor", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "numeric", "numeric"))

readr

readr is similar to utils, but is faster and less verbose. readr returns a “tibble” instead of a data frame. Functions read_csv() and read_tsv() are wrappers for read_delim(), similar to the construction in package utils.

  • Default col_names = TRUE sets column names to the first row of data. Set col_names = FALSE for system-generated names or set col_names = c() to set the column names to a character vector.
  • col_types = c() sets data types. NULL elements in the vector drop the variable. Use shorthand strings where col_types = "cd_il") means “character, double, (skip), integer, logical”.
  • Collector functions col_factor() and col_integer() also set column types.
library(readr)
pools <- file.path("Programs/Data", "swimming_pools.csv")
# or, if on the web,
pools.path <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"
pools <- read_csv(pools.path)
## Parsed with column specification:
## cols(
##   Name = col_character(),
##   Address = col_character(),
##   Latitude = col_double(),
##   Longitude = col_double()
## )
potatoes.path <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/potatoes.txt"
potatoes <- read_delim(potatoes.path, delim = "\t")
## Parsed with column specification:
## cols(
##   area = col_integer(),
##   temp = col_integer(),
##   size = col_integer(),
##   storage = col_integer(),
##   method = col_integer(),
##   texture = col_double(),
##   flavor = col_double(),
##   moistness = col_double()
## )
machine <- file.path("Programs/Data", "machine.txt")
properties <- c("new", "old")
machine.fragment <- read_tsv(machine, skip = 6, n_max = 5, 
                              col_names = properties)
## Parsed with column specification:
## cols(
##   new = col_double(),
##   old = col_double()
## )
hotdogs <- file.path("Programs/Data", "hotdogs.txt")
hotdogs_factor <- read_tsv(hotdogs,
                           col_names = c("type", "calories", "sodium"),
                           skip = 1)
## Parsed with column specification:
## cols(
##   type = col_character(),
##   calories = col_double(),
##   sodium = col_double()
## )

data.table

The data.table package is optimized for large files. fread() is faster and more convenient than read.table.

library(data.table)
## Warning: package 'data.table' was built under R version 3.4.4
## 
## Attaching package: 'data.table'
## The following object is masked from 'package:purrr':
## 
##     transpose
pools <- file.path("Programs/Data", "swimming_pools.csv")

machine <- file.path("Programs/Data", "machine.txt")
properties <- c("new", "old")
machine.fragment <- fread(machine)

Excel

There are three packages to choose from, readxl, gdata, and XLConnect. gdata only handles .xls files and will be replaced when readxl is more mature. XLConnect is designed to work with Excel through R.

readxl

readxl cannot read directly from the internet. First download the file, then import the file.

Packagage readxl functions excel_sheets() lists the available sheets, read_excel() reads the file.

  • Default col_names = TRUE sets column names to the first row of data. Set col_names = FALSE for system-generated names or set col_names = c() to set the column names to a character vector.
  • col_types = c() sets data types. “blank” elements in the vector drop the variable.
  • skip skips lines. If first line is column names, you will have to manually set it.
library(readxl)
## Warning: package 'readxl' was built under R version 3.4.4
url_xls <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls"
download.file(url_xls, file.path("Programs/Data", "local_latitude.xls"))
#excel_readxl <- read_excel(file.path("C:/Users/mpfol/OneDrive/Documents/Data Analysis/Programs/Data", "local_latitude.xls"))

mini.path <- file.path("Programs/Data", "MinitabIntroData.xlsx")
excel_sheets(mini.path)
## [1] "Sheet1" "Sheet2"
sheet1 <- read_excel(mini.path, sheet = "Sheet1")
sheet2 <- read_excel(mini.path, sheet = "Sheet2")
sheet.list = list(sheet1, sheet2)

# Equivalently...
sheet.list <- lapply(excel_sheets(mini.path), 
                     read_excel, path = mini.path)

gdata

gdata requires perl in the background. It can only read .xls files. It can read directly from web sites though.

library(gdata)
## Warning: package 'gdata' was built under R version 3.4.4
## gdata: Unable to locate valid perl interpreter
## gdata: 
## gdata: read.xls() will be unable to read Excel XLS and XLSX files
## gdata: unless the 'perl=' argument is used to specify the location
## gdata: of a valid perl intrpreter.
## gdata: 
## gdata: (To avoid display of this message in the future, please
## gdata: ensure perl is installed and available on the executable
## gdata: search path.)
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLX' (Excel 97-2004) files.
## 
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLSX' (Excel 2007+) files.
## 
## gdata: Run the function 'installXLSXsupport()'
## gdata: to automatically download and install the perl
## gdata: libaries needed to support Excel XLS and XLSX formats.
## 
## Attaching package: 'gdata'
## The following objects are masked from 'package:data.table':
## 
##     first, last
## The following object is masked from 'package:purrr':
## 
##     keep
## The following object is masked from 'package:stats':
## 
##     nobs
## The following object is masked from 'package:utils':
## 
##     object.size
## The following object is masked from 'package:base':
## 
##     startsWith
url_xls <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls"
#read.xls(url_xls)

XLConnect

#library(XLConnect)
mini.path <- file.path("Programs/Data", "MinitabIntroData.xlsx")
#my_book <- loadWorkbook(mini.path)
#class(my_book)
#getSheets(my_book)
#readWorksheet(my_book, sheet = 2)
#all <- lapply(sheets, readWorksheet, object = my_book)
#str(all)
#createSheet(my_book, name = "year_2010")
#writeWorksheet(my_book, pop_2010, sheet = "year_2010")
#saveWorkbook(my_book, file = "MinitabIntroData2.xlsx")

Other Sources

Databases

There is a dedicated package for each DBMS: RMySQL, RPostgresSQL, ROracle, etc. Function dbGetQuery() is a convenient aggregator of three functions, dbSendQuery(), dbFetch(), and dbClearResults(). Use the three functions if the data set is large and only a chunk of data is needed at a time.

library(DBI)
## Warning: package 'DBI' was built under R version 3.4.4
con <- dbConnect(RMySQL::MySQL(), 
                 dbname = "tweater", 
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com", 
                 port = 3306,
                 user = "student",
                 password = "datacamp")
con
## <MySQLConnection:0,0>
# read all tables into a list of data frames
table_names <- dbListTables(con)
tables <- lapply(table_names, dbReadTable, conn = con)
# read an entire table, then subset the rows you want (inefficient)
comments <- dbReadTable(con, "comments")
subset(comments,
       subset = user_id == 1,
       tweat_id = 77)
##      id tweat_id user_id            message
## 4  1012       87       1   awesome! thanks!
## 7  1004       49       1  this is fabulous!
## 11 1020       77       1 couldn't be better
## 12 1014       77       1       saved my day
elisabeth <- dbGetQuery(con, "SELECT tweat_id FROM comments 
                        WHERE user_id = 1")
latest <- dbGetQuery(con, "SELECT post FROM tweats WHERE date > \"2015-09-21\"")

dbDisconnect(con)
## [1] TRUE

Internet

If a file resides on the web, reference it directly instead of manually downloading. For the excel package, you will have to first download the file.

url = "http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r"
dest_path = file.path("~", "local_cities.xlsx")
#download.file(url, dest_path)

The httr package also handles internet files.

library(httr)
## Warning: package 'httr' was built under R version 3.4.4
resp <- GET("http://www.example.com/")
raw_content <- content(resp, as = "raw")
head(raw_content)
## [1] 3c 21 64 6f 63 74

API’s and JSON

JSON files are either name-value pair objects {“id”:1,“name”:“Frank”}, or arrays [1,2,3,“dog”].

library(jsonlite)
## Warning: package 'jsonlite' was built under R version 3.4.4
## 
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
## 
##     flatten
wine_json <- '{"name":"Chateau Migraine", "year":1997, "alcohol_pct":12.4, "color":"red", "awarded":false}'
# Convert file JSON into list
wine <- fromJSON(wine_json)
str(wine)
## List of 5
##  $ name       : chr "Chateau Migraine"
##  $ year       : int 1997
##  $ alcohol_pct: num 12.4
##  $ color      : chr "red"
##  $ awarded    : logi FALSE
# Convert web API JSON into list
url_sw4 <- "http://www.omdbapi.com/?apikey=ff21610b&i=tt0076759&r=json"
url_sw3 <- "http://www.omdbapi.com/?apikey=ff21610b&i=tt0121766&r=json"

# Import two URLs with fromJSON(): sw4 and sw3
#sw4 <- fromJSON(url_sw4)
#sw3 <- fromJSON(url_sw3)

# Print the Title element of both lists
#sw4$Title
#sw3$Title

# Convert mtcars to a pretty JSON: pretty_json
pretty_json <- toJSON(mtcars, pretty = TRUE)
pretty_json
## [
##   {
##     "mpg": 21,
##     "cyl": 6,
##     "disp": 160,
##     "hp": 110,
##     "drat": 3.9,
##     "wt": 2.62,
##     "qsec": 16.46,
##     "vs": 0,
##     "am": 1,
##     "gear": 4,
##     "carb": 4,
##     "_row": "Mazda RX4"
##   },
##   {
##     "mpg": 21,
##     "cyl": 6,
##     "disp": 160,
##     "hp": 110,
##     "drat": 3.9,
##     "wt": 2.875,
##     "qsec": 17.02,
##     "vs": 0,
##     "am": 1,
##     "gear": 4,
##     "carb": 4,
##     "_row": "Mazda RX4 Wag"
##   },
##   {
##     "mpg": 22.8,
##     "cyl": 4,
##     "disp": 108,
##     "hp": 93,
##     "drat": 3.85,
##     "wt": 2.32,
##     "qsec": 18.61,
##     "vs": 1,
##     "am": 1,
##     "gear": 4,
##     "carb": 1,
##     "_row": "Datsun 710"
##   },
##   {
##     "mpg": 21.4,
##     "cyl": 6,
##     "disp": 258,
##     "hp": 110,
##     "drat": 3.08,
##     "wt": 3.215,
##     "qsec": 19.44,
##     "vs": 1,
##     "am": 0,
##     "gear": 3,
##     "carb": 1,
##     "_row": "Hornet 4 Drive"
##   },
##   {
##     "mpg": 18.7,
##     "cyl": 8,
##     "disp": 360,
##     "hp": 175,
##     "drat": 3.15,
##     "wt": 3.44,
##     "qsec": 17.02,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 2,
##     "_row": "Hornet Sportabout"
##   },
##   {
##     "mpg": 18.1,
##     "cyl": 6,
##     "disp": 225,
##     "hp": 105,
##     "drat": 2.76,
##     "wt": 3.46,
##     "qsec": 20.22,
##     "vs": 1,
##     "am": 0,
##     "gear": 3,
##     "carb": 1,
##     "_row": "Valiant"
##   },
##   {
##     "mpg": 14.3,
##     "cyl": 8,
##     "disp": 360,
##     "hp": 245,
##     "drat": 3.21,
##     "wt": 3.57,
##     "qsec": 15.84,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 4,
##     "_row": "Duster 360"
##   },
##   {
##     "mpg": 24.4,
##     "cyl": 4,
##     "disp": 146.7,
##     "hp": 62,
##     "drat": 3.69,
##     "wt": 3.19,
##     "qsec": 20,
##     "vs": 1,
##     "am": 0,
##     "gear": 4,
##     "carb": 2,
##     "_row": "Merc 240D"
##   },
##   {
##     "mpg": 22.8,
##     "cyl": 4,
##     "disp": 140.8,
##     "hp": 95,
##     "drat": 3.92,
##     "wt": 3.15,
##     "qsec": 22.9,
##     "vs": 1,
##     "am": 0,
##     "gear": 4,
##     "carb": 2,
##     "_row": "Merc 230"
##   },
##   {
##     "mpg": 19.2,
##     "cyl": 6,
##     "disp": 167.6,
##     "hp": 123,
##     "drat": 3.92,
##     "wt": 3.44,
##     "qsec": 18.3,
##     "vs": 1,
##     "am": 0,
##     "gear": 4,
##     "carb": 4,
##     "_row": "Merc 280"
##   },
##   {
##     "mpg": 17.8,
##     "cyl": 6,
##     "disp": 167.6,
##     "hp": 123,
##     "drat": 3.92,
##     "wt": 3.44,
##     "qsec": 18.9,
##     "vs": 1,
##     "am": 0,
##     "gear": 4,
##     "carb": 4,
##     "_row": "Merc 280C"
##   },
##   {
##     "mpg": 16.4,
##     "cyl": 8,
##     "disp": 275.8,
##     "hp": 180,
##     "drat": 3.07,
##     "wt": 4.07,
##     "qsec": 17.4,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 3,
##     "_row": "Merc 450SE"
##   },
##   {
##     "mpg": 17.3,
##     "cyl": 8,
##     "disp": 275.8,
##     "hp": 180,
##     "drat": 3.07,
##     "wt": 3.73,
##     "qsec": 17.6,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 3,
##     "_row": "Merc 450SL"
##   },
##   {
##     "mpg": 15.2,
##     "cyl": 8,
##     "disp": 275.8,
##     "hp": 180,
##     "drat": 3.07,
##     "wt": 3.78,
##     "qsec": 18,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 3,
##     "_row": "Merc 450SLC"
##   },
##   {
##     "mpg": 10.4,
##     "cyl": 8,
##     "disp": 472,
##     "hp": 205,
##     "drat": 2.93,
##     "wt": 5.25,
##     "qsec": 17.98,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 4,
##     "_row": "Cadillac Fleetwood"
##   },
##   {
##     "mpg": 10.4,
##     "cyl": 8,
##     "disp": 460,
##     "hp": 215,
##     "drat": 3,
##     "wt": 5.424,
##     "qsec": 17.82,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 4,
##     "_row": "Lincoln Continental"
##   },
##   {
##     "mpg": 14.7,
##     "cyl": 8,
##     "disp": 440,
##     "hp": 230,
##     "drat": 3.23,
##     "wt": 5.345,
##     "qsec": 17.42,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 4,
##     "_row": "Chrysler Imperial"
##   },
##   {
##     "mpg": 32.4,
##     "cyl": 4,
##     "disp": 78.7,
##     "hp": 66,
##     "drat": 4.08,
##     "wt": 2.2,
##     "qsec": 19.47,
##     "vs": 1,
##     "am": 1,
##     "gear": 4,
##     "carb": 1,
##     "_row": "Fiat 128"
##   },
##   {
##     "mpg": 30.4,
##     "cyl": 4,
##     "disp": 75.7,
##     "hp": 52,
##     "drat": 4.93,
##     "wt": 1.615,
##     "qsec": 18.52,
##     "vs": 1,
##     "am": 1,
##     "gear": 4,
##     "carb": 2,
##     "_row": "Honda Civic"
##   },
##   {
##     "mpg": 33.9,
##     "cyl": 4,
##     "disp": 71.1,
##     "hp": 65,
##     "drat": 4.22,
##     "wt": 1.835,
##     "qsec": 19.9,
##     "vs": 1,
##     "am": 1,
##     "gear": 4,
##     "carb": 1,
##     "_row": "Toyota Corolla"
##   },
##   {
##     "mpg": 21.5,
##     "cyl": 4,
##     "disp": 120.1,
##     "hp": 97,
##     "drat": 3.7,
##     "wt": 2.465,
##     "qsec": 20.01,
##     "vs": 1,
##     "am": 0,
##     "gear": 3,
##     "carb": 1,
##     "_row": "Toyota Corona"
##   },
##   {
##     "mpg": 15.5,
##     "cyl": 8,
##     "disp": 318,
##     "hp": 150,
##     "drat": 2.76,
##     "wt": 3.52,
##     "qsec": 16.87,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 2,
##     "_row": "Dodge Challenger"
##   },
##   {
##     "mpg": 15.2,
##     "cyl": 8,
##     "disp": 304,
##     "hp": 150,
##     "drat": 3.15,
##     "wt": 3.435,
##     "qsec": 17.3,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 2,
##     "_row": "AMC Javelin"
##   },
##   {
##     "mpg": 13.3,
##     "cyl": 8,
##     "disp": 350,
##     "hp": 245,
##     "drat": 3.73,
##     "wt": 3.84,
##     "qsec": 15.41,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 4,
##     "_row": "Camaro Z28"
##   },
##   {
##     "mpg": 19.2,
##     "cyl": 8,
##     "disp": 400,
##     "hp": 175,
##     "drat": 3.08,
##     "wt": 3.845,
##     "qsec": 17.05,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 2,
##     "_row": "Pontiac Firebird"
##   },
##   {
##     "mpg": 27.3,
##     "cyl": 4,
##     "disp": 79,
##     "hp": 66,
##     "drat": 4.08,
##     "wt": 1.935,
##     "qsec": 18.9,
##     "vs": 1,
##     "am": 1,
##     "gear": 4,
##     "carb": 1,
##     "_row": "Fiat X1-9"
##   },
##   {
##     "mpg": 26,
##     "cyl": 4,
##     "disp": 120.3,
##     "hp": 91,
##     "drat": 4.43,
##     "wt": 2.14,
##     "qsec": 16.7,
##     "vs": 0,
##     "am": 1,
##     "gear": 5,
##     "carb": 2,
##     "_row": "Porsche 914-2"
##   },
##   {
##     "mpg": 30.4,
##     "cyl": 4,
##     "disp": 95.1,
##     "hp": 113,
##     "drat": 3.77,
##     "wt": 1.513,
##     "qsec": 16.9,
##     "vs": 1,
##     "am": 1,
##     "gear": 5,
##     "carb": 2,
##     "_row": "Lotus Europa"
##   },
##   {
##     "mpg": 15.8,
##     "cyl": 8,
##     "disp": 351,
##     "hp": 264,
##     "drat": 4.22,
##     "wt": 3.17,
##     "qsec": 14.5,
##     "vs": 0,
##     "am": 1,
##     "gear": 5,
##     "carb": 4,
##     "_row": "Ford Pantera L"
##   },
##   {
##     "mpg": 19.7,
##     "cyl": 6,
##     "disp": 145,
##     "hp": 175,
##     "drat": 3.62,
##     "wt": 2.77,
##     "qsec": 15.5,
##     "vs": 0,
##     "am": 1,
##     "gear": 5,
##     "carb": 6,
##     "_row": "Ferrari Dino"
##   },
##   {
##     "mpg": 15,
##     "cyl": 8,
##     "disp": 301,
##     "hp": 335,
##     "drat": 3.54,
##     "wt": 3.57,
##     "qsec": 14.6,
##     "vs": 0,
##     "am": 1,
##     "gear": 5,
##     "carb": 8,
##     "_row": "Maserati Bora"
##   },
##   {
##     "mpg": 21.4,
##     "cyl": 4,
##     "disp": 121,
##     "hp": 109,
##     "drat": 4.11,
##     "wt": 2.78,
##     "qsec": 18.6,
##     "vs": 1,
##     "am": 1,
##     "gear": 4,
##     "carb": 2,
##     "_row": "Volvo 142E"
##   }
## ]
# Minify pretty_json: mini_json
mini_json <- minify(pretty_json)
mini_json
## [{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.62,"qsec":16.46,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4"},{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.875,"qsec":17.02,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4 Wag"},{"mpg":22.8,"cyl":4,"disp":108,"hp":93,"drat":3.85,"wt":2.32,"qsec":18.61,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Datsun 710"},{"mpg":21.4,"cyl":6,"disp":258,"hp":110,"drat":3.08,"wt":3.215,"qsec":19.44,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Hornet 4 Drive"},{"mpg":18.7,"cyl":8,"disp":360,"hp":175,"drat":3.15,"wt":3.44,"qsec":17.02,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Hornet Sportabout"},{"mpg":18.1,"cyl":6,"disp":225,"hp":105,"drat":2.76,"wt":3.46,"qsec":20.22,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Valiant"},{"mpg":14.3,"cyl":8,"disp":360,"hp":245,"drat":3.21,"wt":3.57,"qsec":15.84,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Duster 360"},{"mpg":24.4,"cyl":4,"disp":146.7,"hp":62,"drat":3.69,"wt":3.19,"qsec":20,"vs":1,"am":0,"gear":4,"carb":2,"_row":"Merc 240D"},{"mpg":22.8,"cyl":4,"disp":140.8,"hp":95,"drat":3.92,"wt":3.15,"qsec":22.9,"vs":1,"am":0,"gear":4,"carb":2,"_row":"Merc 230"},{"mpg":19.2,"cyl":6,"disp":167.6,"hp":123,"drat":3.92,"wt":3.44,"qsec":18.3,"vs":1,"am":0,"gear":4,"carb":4,"_row":"Merc 280"},{"mpg":17.8,"cyl":6,"disp":167.6,"hp":123,"drat":3.92,"wt":3.44,"qsec":18.9,"vs":1,"am":0,"gear":4,"carb":4,"_row":"Merc 280C"},{"mpg":16.4,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":4.07,"qsec":17.4,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SE"},{"mpg":17.3,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":3.73,"qsec":17.6,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SL"},{"mpg":15.2,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":3.78,"qsec":18,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SLC"},{"mpg":10.4,"cyl":8,"disp":472,"hp":205,"drat":2.93,"wt":5.25,"qsec":17.98,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Cadillac Fleetwood"},{"mpg":10.4,"cyl":8,"disp":460,"hp":215,"drat":3,"wt":5.424,"qsec":17.82,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Lincoln Continental"},{"mpg":14.7,"cyl":8,"disp":440,"hp":230,"drat":3.23,"wt":5.345,"qsec":17.42,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Chrysler Imperial"},{"mpg":32.4,"cyl":4,"disp":78.7,"hp":66,"drat":4.08,"wt":2.2,"qsec":19.47,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Fiat 128"},{"mpg":30.4,"cyl":4,"disp":75.7,"hp":52,"drat":4.93,"wt":1.615,"qsec":18.52,"vs":1,"am":1,"gear":4,"carb":2,"_row":"Honda Civic"},{"mpg":33.9,"cyl":4,"disp":71.1,"hp":65,"drat":4.22,"wt":1.835,"qsec":19.9,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Toyota Corolla"},{"mpg":21.5,"cyl":4,"disp":120.1,"hp":97,"drat":3.7,"wt":2.465,"qsec":20.01,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Toyota Corona"},{"mpg":15.5,"cyl":8,"disp":318,"hp":150,"drat":2.76,"wt":3.52,"qsec":16.87,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Dodge Challenger"},{"mpg":15.2,"cyl":8,"disp":304,"hp":150,"drat":3.15,"wt":3.435,"qsec":17.3,"vs":0,"am":0,"gear":3,"carb":2,"_row":"AMC Javelin"},{"mpg":13.3,"cyl":8,"disp":350,"hp":245,"drat":3.73,"wt":3.84,"qsec":15.41,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Camaro Z28"},{"mpg":19.2,"cyl":8,"disp":400,"hp":175,"drat":3.08,"wt":3.845,"qsec":17.05,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Pontiac Firebird"},{"mpg":27.3,"cyl":4,"disp":79,"hp":66,"drat":4.08,"wt":1.935,"qsec":18.9,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Fiat X1-9"},{"mpg":26,"cyl":4,"disp":120.3,"hp":91,"drat":4.43,"wt":2.14,"qsec":16.7,"vs":0,"am":1,"gear":5,"carb":2,"_row":"Porsche 914-2"},{"mpg":30.4,"cyl":4,"disp":95.1,"hp":113,"drat":3.77,"wt":1.513,"qsec":16.9,"vs":1,"am":1,"gear":5,"carb":2,"_row":"Lotus Europa"},{"mpg":15.8,"cyl":8,"disp":351,"hp":264,"drat":4.22,"wt":3.17,"qsec":14.5,"vs":0,"am":1,"gear":5,"carb":4,"_row":"Ford Pantera L"},{"mpg":19.7,"cyl":6,"disp":145,"hp":175,"drat":3.62,"wt":2.77,"qsec":15.5,"vs":0,"am":1,"gear":5,"carb":6,"_row":"Ferrari Dino"},{"mpg":15,"cyl":8,"disp":301,"hp":335,"drat":3.54,"wt":3.57,"qsec":14.6,"vs":0,"am":1,"gear":5,"carb":8,"_row":"Maserati Bora"},{"mpg":21.4,"cyl":4,"disp":121,"hp":109,"drat":4.11,"wt":2.78,"qsec":18.6,"vs":1,"am":1,"gear":4,"carb":2,"_row":"Volvo 142E"}]

Statistics Packages, haven and foreign

R supports SAS, STATA, and SPSS.

library(haven)
## Warning: package 'haven' was built under R version 3.4.4
sales <- read_sas("http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/sales.sas7bdat")
sugar <- read_dta("http://assets.datacamp.com/production/course_1478/datasets/trade.dta")
# Convert labeled values in Date column to dates
sugar$Date <- as.Date(as_factor(sugar$Date))
dat <- read_dta("http://assets.datacamp.com/production/course_1478/datasets/trade.dta")
library(foreign)
# foreign can load xprt files but not sas7dat files.
# load in the data and store it in the variable cars
cars <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/course/uva/mtcars.csv")
# print the first 6 rows of the dataset using the head() function
head(cars)
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb               car
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4         Mazda RX4
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4     Mazda RX4 Wag
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1        Datsun 710
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1    Hornet 4 Drive
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 Hornet Sportabout
## 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1           Valiant

Change the variable separator for text files with the sep argument. Use sep = 't' for tab.

# load in the dataset
cars <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/course/uva/mtcars_semicolon.csv", sep = ";")

# print the first 6 rows of the dataset
head(cars)
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Get and set you working directory.

getwd()
## [1] "C:/Users/mpfol/OneDrive/Documents/Data Analysis"
list.files()
##  [1] "Analyzing Survey Data in R.Rmd"     
##  [2] "Analyzing_Survey_Data_in_R.html"    
##  [3] "Cookbook for R.Rmd"                 
##  [4] "Cookbook_for_R.html"                
##  [5] "Cookbook_for_R.Rmd"                 
##  [6] "Cookbook_for_R_files"               
##  [7] "Coursework"                         
##  [8] "Data"                               
##  [9] "Data Analysis.docx"                 
## [10] "Data Analysis.xlsx"                 
## [11] "Data Visualization.docx"            
## [12] "Foundations of Inference.Rmd"       
## [13] "Foundations_of_Inference.html"      
## [14] "local_latitude.xls"                 
## [15] "Programs"                           
## [16] "rmarkdown-cheatsheet.pdf"           
## [17] "rsconnect"                          
## [18] "Statistical Analysis.docx"          
## [19] "Statistical Package Syntax (1).docx"
## [20] "Statistics Notes.docx"              
## [21] "Statistics v20170301.docx"

Data Wrangling

Data Exploration

Data exploration starts with evaluation of structure and characteristics using class() (it better be a data.frame), dim(), and names(). Create summaries with str() or glimpse(), and summary(). Run some initial visualizations for insights into distributions. Use histograms for univariate analysis, scatterplots for numeric-numeric bi-variate analysis, and boxplots for numeric-factor bi-variate analysis.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:gdata':
## 
##     combine, first, last
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Check structure
class(mtcars)
## [1] "data.frame"
dim(mtcars)
## [1] 32 11
names(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
# Initial summaries
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
glimpse(mtcars)  # Slightly cleaner version of str (requires dplyr).
## Observations: 32
## Variables: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
hist(mtcars$mpg)

plot(mtcars$mpg, mtcars$qsec)

# View sample data
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
tail(mtcars)
##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

Tidying

Tidy data organizes a single observational unit into rows and columns. Use the tidyr package to tidy messy data.

library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.4
wide_df <- data.frame(Obs=c(1,2),
                      a=c(1,4),
                      b=c(2,5),
                      c=c(3,6),
                      year_mo=c("2010-05","2007-07"))
wide_df
##   Obs a b c year_mo
## 1   1 1 2 3 2010-05
## 2   2 4 5 6 2007-07
# Gather wide data into key-value pairs. Exclude Obs and year_mo
long_df <- gather(wide_df, my_key, my_val, -c(Obs,year_mo))
long_df
##   Obs year_mo my_key my_val
## 1   1 2010-05      a      1
## 2   2 2007-07      a      4
## 3   1 2010-05      b      2
## 4   2 2007-07      b      5
## 5   1 2010-05      c      3
## 6   2 2007-07      c      6
# The opposite of gather() is spread()
wide_df <- spread(long_df, my_key, my_val)
wide_df
##   Obs year_mo a b c
## 1   1 2010-05 1 2 3
## 2   2 2007-07 4 5 6
# Split a column using separate().
long_df_sep <- separate(long_df, col = year_mo, into = c("year","month"), sep = "-")
long_df_sep
##   Obs year month my_key my_val
## 1   1 2010    05      a      1
## 2   2 2007    07      a      4
## 3   1 2010    05      b      2
## 4   2 2007    07      b      5
## 5   1 2010    05      c      3
## 6   2 2007    07      c      6
# The opposite of separate() is unite()
long_df_uni <- unite(long_df_sep, year_mo, year, month, sep = "-")
long_df_uni
##   Obs year_mo my_key my_val
## 1   1 2010-05      a      1
## 2   2 2007-07      a      4
## 3   1 2010-05      b      2
## 4   2 2007-07      b      5
## 5   1 2010-05      c      3
## 6   2 2007-07      c      6

Preparing for Analysis

Types of variables in R: * character * numeric, including NaN and inf. * integer, denoted 123L * factor * logical, included NA.

Coerce variables into data types with * as.character() * as.numeric() * as.integer() * as.factor() * as.logical() where 0 := FALSE * Package lubridate coerces strings to dates. Valid masking characters are y, m, d, h, m, and s. Unite several fields into one with unite(). Rearrange column order with select(). Change the structure of multiple columns with mutate_at.

Because the period (.) has special meaning in certain situations, use underscores (_) to separate words in variable names. Use all lowercase letters so that no one has to remember which letters are uppercase or lowercase.

Package lubridate manipulates dates. Round dates with round_date, floor_date, and ceiling_date. All three take a unit argument specifying the resolution of rounding: “second”, “minute”, “hour”, “day”, “week”, “month”, “bimonth”, “quarter”, “halfyear”, or “year”. Or, you can specify any multiple of those units, e.g. “5 years”, “3 minutes” etc.

library(lubridate)
## Warning: package 'lubridate' was built under R version 3.4.4
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday,
##     week, yday, year
## The following object is masked from 'package:base':
## 
##     date
# There 3! ymd date functions: ymd(), ydm(), mdy(), myd(), dmy(), dym().
# Create datetimes with: _h, _hm, or _hms
as.Date(ymd_hms("2005/10/23 14:40:00"))
## [1] "2005-10-23"
as.POSIXct(mdy("July 21, 2006"))
## [1] "2006-07-20 20:00:00 EDT"
ymd("2006-07-21")
## [1] "2006-07-21"
ymd("2006 Jul 21")
## [1] "2006-07-21"
mdy("July 21, 2006")
## [1] "2006-07-21"
hms("10:25:09")
## [1] "10H 25M 9S"
ymd_hms("2005/10/23 14:40:00")
## [1] "2005-10-23 14:40:00 UTC"
# If date is in an unsupported order like dym_msh, use parse_date_time() with  argument orders specifying the order of the components in the date.

# Combine date parts with make_date(year, month, date).
r_3_4_1 <- ymd_hms("2016-05-03 07:13:28 UTC")

# Date rounding
floor_date(r_3_4_1, unit = "day")
## [1] "2016-05-03 UTC"
round_date(r_3_4_1, unit = "5 minutes")
## [1] "2016-05-03 07:15:00 UTC"
ceiling_date(r_3_4_1, unit = "week")
## [1] "2016-05-08 UTC"

Subtract dates with simple - operator for days unit, or get finer control with base function difftime(t1, t2, units). Available system dates are now and today().

date_landing <- mdy("July 20, 1969")
moment_step <- mdy_hms("July 20, 1969, 02:56:15", tz = "UTC")

difftime(today(), date_landing, units = "days")
## Time difference of 18075 days
difftime(now(), moment_step, units = "secs")
## Time difference of 1561709101 secs

Use timespans to add fixed amount of time to dates. Distinguish periods (human understanding) from durations (number of seconds) to handle daylight savings time gracefully. By combining addition and multiplication with sequences you can generate sequences of datetimes.

library(lubridate)
# Add a period of one week to mon_2pm
mon_2pm <- dmy_hm("27 Aug 2018 14:00")
mon_2pm + weeks(1)
## [1] "2018-09-03 14:00:00 UTC"
# Add a duration of 81 hours to tue_9am
tue_9am <- dmy_hm("28 Aug 2018 9:00")
tue_9am + dhours(81)
## [1] "2018-08-31 18:00:00 UTC"
# A period of five years is longer than a duration of 5 years!
today() - years(5)
## [1] "2014-01-14"
today() - dyears(5)
## [1] "2014-01-15"
# Create combined periods and durations.
eclipse_2017 <- ymd_hms("2017-08-21 18:26:40")
synodic <- ddays(29) + dhours(12) + dminutes(44) + dseconds(3)

# Create datetime for every two weeks for a year
today_8am <- today() + hours(8)
every_two_weeks <- 1:26 * weeks(2)
today_8am + every_two_weeks
##  [1] "2019-01-28 08:00:00 UTC" "2019-02-11 08:00:00 UTC"
##  [3] "2019-02-25 08:00:00 UTC" "2019-03-11 08:00:00 UTC"
##  [5] "2019-03-25 08:00:00 UTC" "2019-04-08 08:00:00 UTC"
##  [7] "2019-04-22 08:00:00 UTC" "2019-05-06 08:00:00 UTC"
##  [9] "2019-05-20 08:00:00 UTC" "2019-06-03 08:00:00 UTC"
## [11] "2019-06-17 08:00:00 UTC" "2019-07-01 08:00:00 UTC"
## [13] "2019-07-15 08:00:00 UTC" "2019-07-29 08:00:00 UTC"
## [15] "2019-08-12 08:00:00 UTC" "2019-08-26 08:00:00 UTC"
## [17] "2019-09-09 08:00:00 UTC" "2019-09-23 08:00:00 UTC"
## [19] "2019-10-07 08:00:00 UTC" "2019-10-21 08:00:00 UTC"
## [21] "2019-11-04 08:00:00 UTC" "2019-11-18 08:00:00 UTC"
## [23] "2019-12-02 08:00:00 UTC" "2019-12-16 08:00:00 UTC"
## [25] "2019-12-30 08:00:00 UTC" "2020-01-13 08:00:00 UTC"

ymd("2018-01-31") + months(1) returns NA. For situations like this, use alternative operators like %m+%.

library(lubridate)

# A sequence of 1 to 12 periods of 1 month
month_seq <- 1:12 * months(1)

# Add 1 to 12 months to jan_31.  This way returns NAs.
ymd("2018-01-31") + month_seq
##  [1] NA           "2018-03-31" NA           "2018-05-31" NA          
##  [6] "2018-07-31" "2018-08-31" NA           "2018-10-31" NA          
## [11] "2018-12-31" "2019-01-31"
# Better way.
ymd("2018-01-31") %m+% month_seq
##  [1] "2018-02-28" "2018-03-31" "2018-04-30" "2018-05-31" "2018-06-30"
##  [6] "2018-07-31" "2018-08-31" "2018-09-30" "2018-10-31" "2018-11-30"
## [11] "2018-12-31" "2019-01-31"

Intervals have a specific start and end time. There are two notations: datetime1 %--% datetime2, or interval(datetime1, datetime2).

# Two ways to create an interval.
dmy("5 January 1961") %--% dmy("30 January 1969")
## [1] 1961-01-05 UTC--1969-01-30 UTC
interval(dmy("5 January 1961"), dmy("30 January 1969"))
## [1] 1961-01-05 UTC--1969-01-30 UTC

Once you have an interval you can find out its start, end, and length with int_start(), int_end() and int_length() respectively. You can test whether a date is %within% and interval. You can test whether two intervals overlap with int_overlaps().

my_intvl <- interval(dmy("5 January 1961"), dmy("30 January 1969"))
int_length(my_intvl)
## [1] 254620800
y2001 <- ymd("2001-01-01") %--% ymd("2001-12-31")
ymd("2001-03-30") %within% y2001
## [1] TRUE

Convert an interval to a period or duration with as.period and as.duration.

my_intvl <- interval(dmy("5 January 1961"), dmy("30 January 1969"))
as.period(my_intvl)
## [1] "8y 0m 25d 0H 0M 0S"
as.duration(my_intvl)
## [1] "254620800s (~8.07 years)"

Extract timezone with tz(). Change timezone with force_tz(dt, tzone=) or temporarily view it with with_tz(dt, tzone=). Get tzone names from ’OlsonNames()`.

game2 <- mdy_hm("June 11 2015 19:00")
game3 <- mdy_hm("June 15 2015 18:30")

# Set the timezone to "America/Edmonton"
game2_local <- force_tz(game2, tzone = "America/Edmonton")
game3_local <- force_tz(game3, tzone = "America/Winnipeg")

# What time is game2_local in NZ?
with_tz(game2_local, tzone = "Pacific/Auckland")
## [1] "2015-06-12 13:00:00 NZST"

stamp is a great way to format a date. It returns a function with format string you specify by example.

stamp("09/20/2017")(today())
## Multiple formats matched: "%Om/%d/%y%H"(1), "%Om/%y/%d%H"(1), "%Om/%d/%Y"(1), "%m/%d/%y%H"(1), "%m/%y/%d%H"(1), "%m/%d/%Y"(1)
## Using: "%Om/%y/%d%H"
## [1] "01/19/1400"

Package stringr manipulates strings.

library(stringr)
# trim whitespace.
str_trim("  this is a test  ")
## [1] "this is a test"
# pad string with zeros.
str_pad("2493", width = 7, side = "left", pad = "0")
## [1] "0002493"
# find pattern Alice
str_detect(c("Sarah", "Alice", "Tom"), "Alice")
## [1] FALSE  TRUE FALSE
# replace pattern Alice with Jeff
str_replace(c("Sarah", "Alice", "Tom"), "Alice", "Jeff")
## [1] "Sarah" "Jeff"  "Tom"
# Change case
toupper("DataCamp")
## [1] "DATACAMP"
tolower("DataCamp")
## [1] "datacamp"

Use is.na() to locate null values.

# 4x3 data frame with a few NAs.
df <- data.frame(A = c(1, NA, 8, NA),
                 B = c(3, NA, 88, 23), 
                 C = c(2, 45, 3, 1),
                 D = c("A", "", "C", "D"))
# Any NAs?
any(is.na(df))
## [1] TRUE
# locate the NAs.
is.na(df)
##          A     B     C     D
## [1,] FALSE FALSE FALSE FALSE
## [2,]  TRUE  TRUE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE
## [4,]  TRUE FALSE FALSE FALSE
# How many?
sum(is.na(df))
## [1] 3
# Summarize the NAs
summary(df)
##        A              B              C         D    
##  Min.   :1.00   Min.   : 3.0   Min.   : 1.00    :1  
##  1st Qu.:2.75   1st Qu.:13.0   1st Qu.: 1.75   A:1  
##  Median :4.50   Median :23.0   Median : 2.50   C:1  
##  Mean   :4.50   Mean   :38.0   Mean   :12.75   D:1  
##  3rd Qu.:6.25   3rd Qu.:55.5   3rd Qu.:13.50        
##  Max.   :8.00   Max.   :88.0   Max.   :45.00        
##  NA's   :2      NA's   :1
# Rows with no missing values, two ways
df[complete.cases(df),]
##   A  B C D
## 1 1  3 2 A
## 3 8 88 3 C
na.omit(df)
##   A  B C D
## 1 1  3 2 A
## 3 8 88 3 C
# Replace empty strings with NA
df$D <- df$D[df$D == ""] <- NA

df2 <- data.frame(A = rnorm(100,50,10),
                  B = c(rnorm(99,50,10), 500),
                  C = c(rnorm(99,50,10), -1))
# Find outliers using hist() or boxplot().
hist(df2$B)

boxplot(df2)

# Drop or replace outliers.  Use which() to find index of offending observation.
mymtcars <- mtcars
ind <- which(mymtcars$mpg == 15.0)
mymtcars$mpg[ind] = 20.0

3. Data Wrangling

3.1 dplyr

The dplyr package provides data wrangling tools. dplyr introduces the tibble, a dataframe constrained to display well in an R session. The tibble class inherits from the data frame class. Work with a tibble using the tbl_df(data.frame) function. glimpse(tbl) works with tibbles the way str(data.frame) works with data frames. Convert a tibble back to a data frame with as.data.frame(tbl).

library(dplyr)

# hflights is a data.frame of Houston based flights.
library(hflights)
## Warning: package 'hflights' was built under R version 3.4.4
hflights <- as_tibble(hflights)
head(hflights)
## # A tibble: 6 x 21
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##   <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
## 1  2011     1          1         6    1400    1500 AA                  428
## 2  2011     1          2         7    1401    1501 AA                  428
## 3  2011     1          3         1    1352    1502 AA                  428
## 4  2011     1          4         2    1403    1513 AA                  428
## 5  2011     1          5         3    1405    1507 AA                  428
## 6  2011     1          6         4    1359    1503 AA                  428
## # ... with 13 more variables: TailNum <chr>, ActualElapsedTime <int>,
## #   AirTime <int>, ArrDelay <int>, DepDelay <int>, Origin <chr>,
## #   Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>
summary(hflights)
##       Year          Month          DayofMonth      DayOfWeek    
##  Min.   :2011   Min.   : 1.000   Min.   : 1.00   Min.   :1.000  
##  1st Qu.:2011   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2.000  
##  Median :2011   Median : 7.000   Median :16.00   Median :4.000  
##  Mean   :2011   Mean   : 6.514   Mean   :15.74   Mean   :3.948  
##  3rd Qu.:2011   3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:6.000  
##  Max.   :2011   Max.   :12.000   Max.   :31.00   Max.   :7.000  
##                                                                 
##     DepTime        ArrTime     UniqueCarrier        FlightNum   
##  Min.   :   1   Min.   :   1   Length:227496      Min.   :   1  
##  1st Qu.:1021   1st Qu.:1215   Class :character   1st Qu.: 855  
##  Median :1416   Median :1617   Mode  :character   Median :1696  
##  Mean   :1396   Mean   :1578                      Mean   :1962  
##  3rd Qu.:1801   3rd Qu.:1953                      3rd Qu.:2755  
##  Max.   :2400   Max.   :2400                      Max.   :7290  
##  NA's   :2905   NA's   :3066                                    
##    TailNum          ActualElapsedTime    AirTime         ArrDelay      
##  Length:227496      Min.   : 34.0     Min.   : 11.0   Min.   :-70.000  
##  Class :character   1st Qu.: 77.0     1st Qu.: 58.0   1st Qu.: -8.000  
##  Mode  :character   Median :128.0     Median :107.0   Median :  0.000  
##                     Mean   :129.3     Mean   :108.1   Mean   :  7.094  
##                     3rd Qu.:165.0     3rd Qu.:141.0   3rd Qu.: 11.000  
##                     Max.   :575.0     Max.   :549.0   Max.   :978.000  
##                     NA's   :3622      NA's   :3622    NA's   :3622     
##     DepDelay          Origin              Dest              Distance     
##  Min.   :-33.000   Length:227496      Length:227496      Min.   :  79.0  
##  1st Qu.: -3.000   Class :character   Class :character   1st Qu.: 376.0  
##  Median :  0.000   Mode  :character   Mode  :character   Median : 809.0  
##  Mean   :  9.445                                         Mean   : 787.8  
##  3rd Qu.:  9.000                                         3rd Qu.:1042.0  
##  Max.   :981.000                                         Max.   :3904.0  
##  NA's   :2905                                                            
##      TaxiIn           TaxiOut         Cancelled       CancellationCode  
##  Min.   :  1.000   Min.   :  1.00   Min.   :0.00000   Length:227496     
##  1st Qu.:  4.000   1st Qu.: 10.00   1st Qu.:0.00000   Class :character  
##  Median :  5.000   Median : 14.00   Median :0.00000   Mode  :character  
##  Mean   :  6.099   Mean   : 15.09   Mean   :0.01307                     
##  3rd Qu.:  7.000   3rd Qu.: 18.00   3rd Qu.:0.00000                     
##  Max.   :165.000   Max.   :163.00   Max.   :1.00000                     
##  NA's   :3066      NA's   :2947                                         
##     Diverted       
##  Min.   :0.000000  
##  1st Qu.:0.000000  
##  Median :0.000000  
##  Mean   :0.002853  
##  3rd Qu.:0.000000  
##  Max.   :1.000000  
## 
# hflights consists of 227,496 observations and 21 variables.
nrow(hflights)
## [1] 227496
ncol(hflights)
## [1] 21
# Create a lookup table for the UniqueCarrier column using a named vector.
lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental", 
         "DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways", 
         "WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier", 
         "FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")
hflights$Carrier <- lut[hflights$UniqueCarrier]

dplyr features five verbs. * select(.data, ...) where ... are variables. Use : to select a range of variables, and - to exclude some variables, similar to indexing a data.frame with square brackets. Use variable names or integer indexes. Use helper functions starts_with(), ends_with(), contains(), matches(), num_range(), and one_of(). * filter(.data, one or more comparisons). Among the operators are ==, !=, and %in%. Combine comparisons with & and |. * arrange(.data, ...). Wrap the arguments with desc() to override the default sort order. * mutate(.data, name-value pair of expressions). * summarise(.data, ...). Base r includes several aggregate functions, and dplyr adds first(), last(), nth(), n(), and n_distinct(). Pipe a data set with %>% into a verb. The filter() verb returns a filtered data set. The arrange() verb returns a sorted data set. Arrange in descending order by arrange(desc(gdpPerCap)). The mutate() verb adds or changes values in the data set. group_by(.data, col(s)). group_by only has an effect when combined with a summarize() function. Specify group_by prior to summarize().

dplry uses %>% from the magrittr package.

library(dplyr)
library(hflights)
hflights <- as_tibble(hflights)
lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental", 
         "DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways", 
         "WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier", 
         "FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")
hflights$Carrier <- lut[hflights$UniqueCarrier]
# select example
select(hflights, UniqueCarrier, ends_with("Num"), starts_with("Cancell"))
## # A tibble: 227,496 x 5
##    UniqueCarrier FlightNum TailNum Cancelled CancellationCode
##  * <chr>             <int> <chr>       <int> <chr>           
##  1 AA                  428 N576AA          0 ""              
##  2 AA                  428 N557AA          0 ""              
##  3 AA                  428 N541AA          0 ""              
##  4 AA                  428 N403AA          0 ""              
##  5 AA                  428 N492AA          0 ""              
##  6 AA                  428 N262AA          0 ""              
##  7 AA                  428 N493AA          0 ""              
##  8 AA                  428 N477AA          0 ""              
##  9 AA                  428 N476AA          0 ""              
## 10 AA                  428 N504AA          0 ""              
## # ... with 227,486 more rows
# mutate example
g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime)
## Warning: `as_dictionary()` is soft-deprecated as of rlang 0.3.0.
## Please use `as_data_pronoun()` instead
## This warning is displayed once per session.
## Warning: `new_overscope()` is soft-deprecated as of rlang 0.2.0.
## Please use `new_data_mask()` instead
## This warning is displayed once per session.
## Warning: The `parent` argument of `new_data_mask()` is deprecated.
## The parent of the data mask is determined from either:
## 
##   * The `env` argument of `eval_tidy()`
##   * Quosure environments when applicable
## This warning is displayed once per session.
## Warning: `overscope_clean()` is soft-deprecated as of rlang 0.2.0.
## This warning is displayed once per session.
# filter example
hflights %>%
  mutate(RealTime = ActualElapsedTime + 100, mph = 60 * Distance/ RealTime) %>%
  filter(!is.na(mph) & mph < 70) %>%
  group_by(UniqueCarrier) %>%
  summarize(n_less = n(), n_dest = n_distinct(Dest), min_dist = min(Distance), max_dist = max(Distance))
## # A tibble: 6 x 5
##   UniqueCarrier n_less n_dest min_dist max_dist
##   <chr>          <int>  <int>    <dbl>    <dbl>
## 1 AA                40      1     224.     224.
## 2 CO              3393      4     140.     305.
## 3 MQ                12      1     247.     247.
## 4 OO               349      3     140.     224.
## 5 WN              1747      4     148.     239.
## 6 XE              1185     12      79.     253.

dplyr works for data frames, data tables, and databases.

Use dplyr to merge data instead of base r merge() because dplr syntax is intuitive, preserves row order, and works with databases.

The four mutating joins are left_join(tbl1, tbl2, by = c(col_names)), right_join, inner_join, and full_join.

Filter join semi_join performs an inner join without returning the secondary table. Filter join anti_join performs a right where the right table is null.

Set functions union(), intersect, and setdiff.

setequal(set1, set2) checks for row equality (not necesarily order).

If two datasets have identical structure, combine with bind_rows() and bind_cols(), the dplyr equivalent to base r rbind() and cbind.

dplyr improves base r functions data.frame with data_frame(). data_frame() will not change data types, add row or column names, or recycle vectors. Function as_data_frame() parellels the behavior of data_frame(). as_data_frame combines a list of vectors into a data frame. It is the column equivalent of bind_rows() which combines data frames.

library(Lahman)
## Warning: package 'Lahman' was built under R version 3.4.4
library(dplyr)

players <- Master %>% 
  distinct(playerID, nameFirst, nameLast)

players %>%
  # Find unsalaried players
  anti_join(Salaries, by = "playerID") %>% 
  # Join Batting to the unsalaried players
  left_join(Batting, by = "playerID") %>% 
  # Group by player
  group_by(playerID) %>% 
  # Sum at-bats for each player
  summarise(total_at_bat = sum(AB, na.rm = TRUE)) %>% 
  # Arrange in descending order
  arrange(desc(total_at_bat))
## # A tibble: 13,958 x 2
##    playerID  total_at_bat
##    <chr>            <int>
##  1 aaronha01        12364
##  2 yastrca01        11988
##  3 cobbty01         11434
##  4 musiast01        10972
##  5 mayswi01         10881
##  6 robinbr01        10654
##  7 wagneho01        10430
##  8 brocklo01        10332
##  9 ansonca01        10277
## 10 aparilu01        10230
## # ... with 13,948 more rows
library(Lahman)
library(dplyr)

# Find the distinct players that appear in HallOfFame
nominated <- HallOfFame %>% 
  distinct(playerID)

nominated %>% 
  # Count the number of players in nominated
  count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1  1260
# 1,239 players were nominated for the hall of fame.

nominated_full <- nominated %>% 
  # Join to Master
  left_join(Master, by = "playerID") %>% 
  # Return playerID, nameFirst, nameLast
  select(playerID, nameFirst, nameLast)

# Find distinct players in HallOfFame with inducted == "Y"
inducted <- HallOfFame %>% 
  filter(inducted == "Y") %>% 
  distinct(playerID)

inducted %>% 
  # Count the number of players in inducted
  count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1   317
# 312 players have been inducted.

inducted_full <- inducted %>% 
  # Join to Master
  left_join(Master, by = "playerID") %>% 
  # Return playerID, nameFirst, nameLast
  select(playerID, nameFirst, nameLast)


# Tally the number of awards in AwardsPlayers by playerID
nAwards <- AwardsPlayers %>% 
  group_by(playerID) %>% 
  tally()

nAwards %>% 
  # Filter to just the players in inducted 
  semi_join(inducted, by = "playerID") %>% 
  # Calculate the mean number of awards per player
  summarize(avg_n = mean(n, na.rm = TRUE))
## # A tibble: 1 x 1
##   avg_n
##   <dbl>
## 1  12.1
nAwards %>% 
  # Filter to just the players in nominated 
  semi_join(nominated, by = "playerID") %>% 
  # Filter to players NOT in inducted 
  anti_join(inducted, by = "playerID") %>% 
  # Calculate the mean number of awards per player
  summarize(avg_n = mean(n, na.rm = TRUE))
## # A tibble: 1 x 1
##   avg_n
##   <dbl>
## 1  4.23
# On Average, inductees had 11.95 - 4.23 = 7.72 more awards than non-inductees. 


# Find the players who are in nominated, but not inducted
notInducted <- nominated %>% 
  setdiff(inducted)

Salaries %>% 
  # Find the players who are in notInducted
  semi_join(notInducted, by = "playerID") %>% 
  # Calculate the max salary by player
  group_by(playerID) %>%
  summarize(max_salary = max(salary, na.rm = TRUE)) %>% 
  # Calculate the average of the max salaries
  summarize(avg_salary = mean(max_salary, na.rm = TRUE)) 
## # A tibble: 1 x 1
##   avg_salary
##        <dbl>
## 1   5230273.
# Repeat for players who were inducted
Salaries %>% 
  semi_join(inducted, by = "playerID") %>% 
  group_by(playerID) %>%
  summarize(max_salary = max(salary, na.rm = TRUE)) %>% 
  summarize(avg_salary = mean(max_salary, na.rm = TRUE))
## # A tibble: 1 x 1
##   avg_salary
##        <dbl>
## 1   6092038.
Appearances %>% 
  # Filter Appearances against nominated
  semi_join(nominated, by = "playerID") %>% 
  # Find last year played by player
  group_by(playerID) %>% 
  summarize(last_year = max(yearID)) %>% 
  # Join to full HallOfFame
  left_join(HallOfFame, by = "playerID") %>% 
  # Filter for unusual observations
  filter((yearID - last_year)<5)
## # A tibble: 194 x 10
##    playerID  last_year yearID votedBy ballots needed votes inducted
##    <chr>         <dbl>  <int> <chr>     <int>  <int> <int> <fct>   
##  1 altroni01     1933.   1937 BBWAA       201    151     3 N       
##  2 applilu01     1950.   1953 BBWAA       264    198     2 N       
##  3 bartedi01     1946.   1948 BBWAA       121     91     1 N       
##  4 beckro01      2004.   2008 BBWAA       543    408     2 N       
##  5 boudrlo01     1952.   1956 BBWAA       193    145     2 N       
##  6 camildo01     1945.   1948 BBWAA       121     91     1 N       
##  7 chandsp01     1947.   1950 BBWAA       168    126     2 N       
##  8 chandsp01     1947.   1951 BBWAA       226    170     1 N       
##  9 chapmbe01     1946.   1949 BBWAA       153    115     1 N       
## 10 cissebi01     1938.   1937 BBWAA       201    151     1 N       
## # ... with 184 more rows, and 2 more variables: category <fct>,
## #   needed_note <chr>

Data Visualization

Data visualization is about exploratory analysis (investigative) and explanatory analysis.

There are seven grammatical layers of plots; three are required: data, aesthetics, and geometries. The other elements are facets (subplots), statistics (e.g., fitted lines), coordinates, and themes. The grammar of graphics is implemented in the ggplot2 package.

Base r provides plotting functionality, but it comes with limitations. The plot is an image, not an object, so you cannot manipulate it further. It does not present a legend. There is a separate function for each plot type. The lack of a unified framework means you will have to learn each plot type separately: points(), hist(), etc.

Scale the x axis with a scale_x_log10 layer. There are two main reasons to use logarithmic scales in charts and graphs. The first is to respond to skewness towards large values; i.e., cases in which one or a few points are much larger than the bulk of the data. The second is to show percent change or multiplicative factors. On a scaled access with base 2, the value of each tick mark is double the value of the preceding one. An example of a multiplicative factor is constant acceleration. More on scales for continuous data here.

Scatterplots

For scatterplots, map x, y, color, and shape in the aesthetic layer. Map size, fill, shape, alpha (transparency), and position (e.g., “jitter”) in the geom_point layer.

mtcars$cyl <- as.factor(mtcars$cyl)

# Use base r to create plots with a series for each cyl value.
# Add a linear fit line through the points, one for each series, and one overall.
plot(mtcars$wt, mtcars$mpg, col = factor(mtcars$cyl))
abline(lm(mpg ~ wt, data = mtcars), lty = 2)
lapply(mtcars$cyl, function(x) {
  abline(lm(mpg ~ wt, mtcars, subset = (cyl == x)), col = x)
  })
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL
## 
## [[9]]
## NULL
## 
## [[10]]
## NULL
## 
## [[11]]
## NULL
## 
## [[12]]
## NULL
## 
## [[13]]
## NULL
## 
## [[14]]
## NULL
## 
## [[15]]
## NULL
## 
## [[16]]
## NULL
## 
## [[17]]
## NULL
## 
## [[18]]
## NULL
## 
## [[19]]
## NULL
## 
## [[20]]
## NULL
## 
## [[21]]
## NULL
## 
## [[22]]
## NULL
## 
## [[23]]
## NULL
## 
## [[24]]
## NULL
## 
## [[25]]
## NULL
## 
## [[26]]
## NULL
## 
## [[27]]
## NULL
## 
## [[28]]
## NULL
## 
## [[29]]
## NULL
## 
## [[30]]
## NULL
## 
## [[31]]
## NULL
## 
## [[32]]
## NULL
legend(x = 5, y = 33, legend = levels(mtcars$cyl),
       col = 1:3, pch = 1, bty = "n")

# Again in ggplot2
# The first geom_smooth inherits the ggplot color aesthetic as its group.
# The second geom_smooth explicity sets group to a dummy 1.  The col = "All" adds it to the legend.
# When mapping onto color you can sometimes treat a continuous scale, like year, as an ordinal variable, but only if it is a regular series. The better alternative is to leave it as a continuous variable and use the group aesthetic as a factor to make sure your plot is drawn correctly. 
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4

ggplot(mtcars, aes(x = wt, y = mpg, col = cyl, group = factor(cyl))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) + 
  geom_smooth(method = "lm", se = FALSE, linetype = 2, aes(group = 1, col = "All"))

ggplot can visualize four attributes at once with x, y, col, and facet_grid. Such graphing requires tidy data, which in turn requires thoughtful definitions of metrics. In the iris data set, if measuring length vs width, then those are separate variables (cols). If measuring length (or width) vs species, then species is a variable. If measuring length (or width) vs part of flower (petal vs sepal), then flower part is a variable. To look at all four together, then length and width are members of the measure variable (because length and width share units).

library(ggplot2)
library(tidyr)

iris.tidy <- iris %>%
  # gather(data, key, value, <cols>)
  # Transpose all cols to rows except the identifier cols (Species)
  # The former call name becomes a value in the key column.
  gather(key, Value, -Species) %>%
  # separate(data, col, into, sep)
  separate(col = key, into = c("Part", "Measure"), sep = "\\.")

# If we want the ploy Length vs width, then each should be a column.
iris$Flower <- 1:nrow(iris)
iris.wide <- iris %>%
  gather(key, value, -Species, -Flower) %>%
  separate(key, c("Part", "Measure"), "\\.") %>%
  spread(Measure, value)
ggplot(iris.wide, aes(x = Length, y = Width, color = Part)) +
  geom_jitter() +
  facet_grid(. ~ Species)

Typical aesthetics are x, y, colour, fill, size, alpha, linetype, labels, and shape. shapes 1:20 can accept only the color aesthetic, and shapes 21:25 accepts both color and fill.

One common technique to use with solid shapes is alpha blending (i.e. adding transparency). An alternative is to use hollow shapes.

library(ggplot2)

# Basic scatter plot: wt on x-axis and mpg on y-axis; map cyl to col
ggplot(mtcars, aes(x = wt, y = mpg, color = cyl)) +
geom_point(size = 4)

# Hollow circles - an improvement
ggplot(mtcars, aes(x = wt, y = mpg, color = cyl)) +
geom_point(size = 4, shape = 1)

# Add transparency - very nice
ggplot(mtcars, aes(x = wt, y = mpg, color = cyl)) +
geom_point(size = 4, alpha = 0.6)

The default geom_smooth method is LOESS. LOESS smoothing is a non-parametric form of regression that uses a weighted, sliding-window, average to calculate a line of best fit. Control the size of this window with the span argument. The default span is 0.9. Reducing span creates a better fit, but risks over-fitting.

Another useful stat function is stat_sum(). This function calculates the total number of overlapping observations and is another good alternative to overplotting.

Histograms

The x axis/aesthetic: The argument stat defaults to “bin” to cut up a continuous variable into discrete bins. The argument binwidth defaults to range/30. This is a good starting point if you do not know anything about the variable and want to start exploring. The y axis/aesthetic: geom_histogram() only requires one aesthetic: x. But there is clearly a y axis, so where does it come from? The variable ..count.. is mapped to the y aesthetic. There is an internal data frame where this information is stored. The .. calls the variable count from this internal data frame. This is what appears on the y aesthetic. But it gets better! The ..density.. has also been calculated.

library(ggplot2)

# 1 - Make a univariate histogram
ggplot(mtcars, aes(x = mpg)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# 2 - Plot 1, plus set binwidth to 1 in the geom layer
ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(binwidth = 1)

# 3 - Plot 2, plus MAP ..density.. to the y aesthetic (i.e. in a second aes() function)
ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(aes(y = ..density..), binwidth = 1)

# 4 - plot 3, plus SET the fill attribute to "#377EB8"
ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(aes(y = ..density.., fill = "#377EB8"), binwidth = 1)

#ggplot(gapminder_1952, aes(x = pop)) + 
#geom_histogram() + 
#scale_x_log10()

Bar Plots

Use bar plots to compare numeric values across categorical variables. Bar plots use layer geom_col().

Like geom_point(), the geom_bar() and geom_histogram() geoms have a position argument which specifies how to draw the bars of the plot. Three common positions are stack (default) with counts, fill with proportions, and dodge with counts.

library(ggplot2)

mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$am <- as.factor(mtcars$am)

# Draw a bar plot of cyl, filled according to am
ggplot(mtcars, aes(x = cyl, fill = am)) +
  geom_bar()

# Change the position argument to stack
ggplot(mtcars, aes(x = cyl, fill = am)) +
  geom_bar(position = "stack")

# Change the position argument to fill
ggplot(mtcars, aes(x = cyl, fill = am)) +
  geom_bar(position = "fill")

# Change the position argument to dodge
ggplot(mtcars, aes(x = cyl, fill = am)) +
  geom_bar(position = "dodge")

# Set the amount of dodging by specifying dodge as its own object.
posn_d = position_dodge(width = 0.2)
ggplot(mtcars, aes(x = cyl, fill = am)) +
  geom_bar(position = posn_d, alpha = 0.6)

by_continent <- gapminder %>%
filter(year == 1952) %>%
group_by(continent) %>%
summarize(medianGdpPercap = median(gdpPercap))

# Create a bar plot showing medianGdp by continent'
ggplot(by_continent, aes(x = continent, y = medianGdpPercap)) +
geom_col()

Set geom_point(position = ) attributes with identity (default), dodge (side-by-side bar), stack, fill (stacked bar), jitter, and jitterdodge. Set aesthetic scale functions with scale_<aesthetic>_<data_type>.

library(ggplot2)
cyl.am <- ggplot(mtcars, aes(x = factor(cyl), fill = factor(am)))
val = c("#E41A1C", "#377EB8")
lab = c("Manual", "Automatic")
cyl.am +
  geom_bar(position = "dodge") +
  scale_x_discrete("Cylinders") + 
  scale_y_continuous("Number") +
  scale_fill_manual("Transmission", 
                    values = val,
                    labels = lab)

Line Plots

Line plots are almost exactly like scatter plots. Use them to show change over time.

gapminder_gt1952 <- gapminder %>%
  filter(year >= 1952) %>%
  group_by(year, continent) %>%
  summarize(medianGdpPercap = median(gdpPercap), sumPop = sum(pop))
## Warning in summarise_impl(.data, dots): integer overflow - use
## sum(as.numeric(.))

## Warning in summarise_impl(.data, dots): integer overflow - use
## sum(as.numeric(.))

## Warning in summarise_impl(.data, dots): integer overflow - use
## sum(as.numeric(.))

## Warning in summarise_impl(.data, dots): integer overflow - use
## sum(as.numeric(.))

## Warning in summarise_impl(.data, dots): integer overflow - use
## sum(as.numeric(.))

## Warning in summarise_impl(.data, dots): integer overflow - use
## sum(as.numeric(.))

## Warning in summarise_impl(.data, dots): integer overflow - use
## sum(as.numeric(.))

## Warning in summarise_impl(.data, dots): integer overflow - use
## sum(as.numeric(.))
ggplot(gapminder_gt1952, aes(x = year, y = medianGdpPercap, 
                           color = continent)) + 
  geom_line()

library(ggplot2)
recess <- data.frame(begin = c('1970-01-01', '1975-01-01', '1980-01-01', '1982-01-01', '1991-01-01', '2001-01-01'), 
                     end = c('1970-12-01', '1976-12-01', '1980-12-01', '1983-12-01', '1991-12-01', '2001-12-01'))
recess$begin <- as.Date(recess$begin)
recess$end <- as.Date(recess$end)
ggplot(economics, aes(x = date, y = unemploy/pop)) +
  geom_rect(data = recess,
         aes(xmin = begin, xmax = end, ymin = -Inf, ymax = +Inf),
         inherit.aes = FALSE, fill = "red", alpha = 0.2) +
  geom_line()

qplot is a quick-and-dirty variation of ggplot.

Two options for changing the coordinate layer is scale_x_continuous and coord_cartesian.

library(ggplot2)
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$am <- as.factor(mtcars$am)
p <- ggplot(mtcars, aes(x = wt, y = hp, col = am)) + geom_point() + geom_smooth()

# Add scale_x_continuous()
p + scale_x_continuous(limits = c(3, 6), expand = c(0, 0))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 12 rows containing non-finite values (stat_smooth).
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : at 3.168
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : radius 4e-006
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : all data on boundary of neighborhood. make span bigger
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 3.168
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.002
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : at 3.572
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : radius 4e-006
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : all data on boundary of neighborhood. make span bigger
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4e-006
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : zero-width neighborhood. make span bigger

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : zero-width neighborhood. make span bigger
## Warning: Computation failed in `stat_smooth()`:
## NA/NaN/Inf in foreign function call (arg 5)
## Warning: Removed 12 rows containing missing values (geom_point).

# Add coord_cartesian(): the proper way to zoom in
p + coord_cartesian(xlim = c(3, 6))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Set the aspect ratio of a plot with coord_fixed() or coord_equal(). Both use ratio = 1 as a default. A 1:1 aspect ratio is appropriate when two continuous variables are on the same scale, as with the iris dataset.

# Complete basic scatter plot function
base.plot <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
               geom_jitter() +
               geom_smooth(method = "lm", se = FALSE)

# Plot base.plot: default aspect ratio
base.plot

# Fix aspect ratio (1:1) of base.plot
base.plot + coord_equal()

Facets are another way of presenting categorical variables. The most straightforward way of using facets is facet_grid(). Here we just need to specify the categorical variable to use on rows and columns using standard R formula notation (rows ~ columns).

# Basic scatter plot
p <- ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point()

# 1 - Separate rows according to transmission type, am
p +
  facet_grid(am ~ .)

# 2 - Separate columns according to cylinders, cyl
p +
  facet_grid(. ~ cyl)

# 3 - Separate by both columns and rows 
p +
  facet_grid(am ~ cyl)

The themes layer handle all the non-data ink attributes. To change the appearance of lines use the element_line() function.

library(ggplot2)
#library(Hmisc)

# Base layers
m <- ggplot(mtcars, aes(x = cyl, y = wt))

# Draw dynamite plot
m +
  stat_summary(fun.y = mean, geom = "bar", fill = "skyblue") +
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width = 0.1)

library(haven)
# data from https://healthpolicy.ucla.edu/chis/data/Pages/GetCHISData.aspx.
# login as mpfoley73/HealthPolicy.
chis_path <- file.path("C:/Users/mpfol/OneDrive/Documents/Data Science/Data/CHIS 2009 PUF- Adult SAS", "adult.sas7bdat")
#chis_path <- file.path("C:/Users/michael.foley/OneDrive - The Centers for Families and Children/Documents/CHIS 2009 PUF- Adult SAS", "adult.sas7bdat")
adult <- read_sas(chis_path)
dim(adult)
## [1] 47614   536
adult <- adult[c("RBMI", "BMI_P", "RACEHPR2", "SRSEX", "SRAGE_P", "MARIT2", "AB1", "ASTCUR", "AB51", "POVLL")]
dim(adult)
## [1] 47614    10
library(dplyr)
adult <- adult %>% filter(RACEHPR2 == 1 | RACEHPR2 == 4 | RACEHPR2 == 5 | RACEHPR2 == 6)
dim(adult)
## [1] 44346    10
# Investigate the relationship between BMI and age. 
# Start by looking at the distributions of the univariate data.
# The default histogram has in insteresting pattern of peaks.  This is probably an artifact of the binning statistice of bins = 30.  its range is 2.23 (range(adult$SRAGE_P) / 30)
library(ggplot2)
ggplot(adult, aes(x = SRAGE_P)) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# The BMI histogram shows right skew.  We might want to remove extreme values.
ggplot(adult, aes(x = BMI_P)) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Typically we explore the relationship between two continuous variables with a scatterplot, but this one does not reveal any interesting trends.
ggplot(adult, aes(x = SRAGE_P, y = BMI_P)) +
  geom_point()

# It turns out BMI is also ordinal: 0-18.49:=Underweight, 18.5-24.99:=Healty-weight, 25-29.99:=Over-weight, 30.0+:=Obese.  Here is a plot by colored category.  But there are still problems because the range of each group differs, and it is difficult to tell the size of each group.
ggplot(adult, aes(x = SRAGE_P, y = BMI_P, col = factor(RBMI))) +
  geom_point(alpha = 0.4, position =position_jitter(width = 0.5))

# Try a histogram instead.  This is good, but we cannot answer meaningful questions.
# Notice one unexpected attribute: it looks like ages >=85 are categorized as 85.
ggplot(adult, aes(x = SRAGE_P, fill = factor(RBMI))) +
  geom_histogram(binwidth = 1)

# How do the proportions of each BMI category change across age groups?  We need to plot proportions.
ggplot(adult, aes(x = SRAGE_P, fill = factor(RBMI))) +
  geom_histogram(aes(y = ..count../sum(..count..)), binwidth = 1, position = "fill")

#  there is an unusual spike of individuals at 85, which seems like an artifact of data collection and storage. Solve this by only keeping observations for which adult$SRAGE_P is smaller than or equal to 84.
adult <- adult[adult$SRAGE_P <= 84, ] 

# There is a long positive tail on the BMIs that we'd like to remove. Only keep observations for which adult$BMI_P is larger than or equal to 16 and adult$BMI_P is strictly smaller than 52.
adult <- adult[adult$BMI_P >= 16 & adult$BMI_P < 52, ]

# We'll focus on the relationship between the BMI score (& category), age and race. To make plotting easier later on, we'll change the labels in the dataset. 
adult$RACEHPR2 <- factor(adult$RACEHPR2, labels = c("Latino", "Asian", "African American", "White"))
adult$RBMI <- ordered(adult$RBMI,
                      levels = c(1, 2, 3, 4),
                      labels = c("Under-weight", "Healthy-weight", "Over-weight", "Obese"))
# The color scale used in the plot
BMI_fill <- scale_fill_brewer("BMI Category", palette = "Reds")

# Theme to fix category display in faceted plot
fix_strips <- theme(strip.text.y = element_text(angle = 0, hjust = 0, vjust = 0.1, size = 14),
                    strip.background = element_blank(),
                    legend.position = "none")

# Histogram, add BMI_fill and customizations
ggplot(adult, aes (x = SRAGE_P, fill= RBMI)) + 
  geom_histogram(binwidth = 1) +
  fix_strips +
  BMI_fill +
  facet_grid(RBMI ~ .) +
  theme_classic()

# The absolute count of multiple histograms is fine, but density would be a more useful measure if we wanted to see how the frequency of one variable changes across another. Here is a frequency histogram when we have many sub-categories. The problem here is that this can't be facetted because the calculations occur on the fly inside ggplot2.
ggplot(adult, aes (x = SRAGE_P, fill= factor(RBMI))) + 
  geom_histogram(aes(y = ..count../sum(..count..)), binwidth = 1, position = "fill") +
  BMI_fill

# To overcome this we're going to calculate the proportions outside ggplot2. 
# Create DF with table()
DF <- table(adult$RBMI, adult$SRAGE_P)
# Use apply on DF to get frequency of each group
DF_freq <- apply(DF, 2, function(x) x/sum(x))
# Load reshape2 and use melt on DF to create DF_melted
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt
DF_melted <- melt(DF_freq)
# Change names of DF_melted
names(DF_melted) <- c("FILL", "X", "value")
# Add code to make this a faceted plot
ggplot(DF_melted, aes(x = X, y = value, fill = FILL)) +
  geom_col(position = "stack") +
  BMI_fill + 
  facet_grid(FILL ~ .) # Facets

Mosaic plots are visualizations of chi-squared tests.

# The initial contingency table
DF <- as.data.frame.matrix(table(adult$SRAGE_P, adult$RBMI))

# Create groupSum, xmax and xmin columns
DF$groupSum <- rowSums(DF)
DF$xmax <- cumsum(DF$groupSum)
DF$xmin <- DF$xmax - DF$groupSum
# The groupSum column needs to be removed; don't remove this line
DF$groupSum <- NULL

# Copy row names to variable X
DF$X <- row.names(DF)

# Melt the dataset
library(reshape2)
DF_melted <- melt(DF, id.vars = c("X", "xmin", "xmax"), variable.name = "FILL")

# dplyr call to calculate ymin and ymax - don't change
library(dplyr)
DF_melted <- DF_melted %>%
  group_by(X) %>%
  mutate(ymax = cumsum(value/sum(value)),
         ymin = ymax - value/sum(value))

# Plot rectangles - don't change
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.4.4
ggplot(DF_melted, aes(ymin = ymin,
                 ymax = ymax,
                 xmin = xmin,
                 xmax = xmax,
                 fill = FILL)) +
  geom_rect(colour = "white") +
  scale_x_continuous(expand = c(0,0)) +
  scale_y_continuous(expand = c(0,0)) +
  BMI_fill +
  theme_tufte()

# Perform chi.sq test (RBMI and SRAGE_P)
results <- chisq.test(table(adult$RBMI, adult$SRAGE_P))

# Melt results$residuals and store as resid
resid <- melt(results$residuals)

# Change names of resid
names(resid) <- c("FILL", "X", "residual")

# merge the two datasets:
DF_all <- merge(DF_melted, resid)

# Update plot command
library(ggthemes)
ggplot(DF_all, aes(ymin = ymin,
                   ymax = ymax,
                   xmin = xmin,
                   xmax = xmax,
                   fill = residual)) +
  geom_rect() +
  scale_fill_gradient2() +
  scale_x_continuous(expand = c(0,0)) +
  scale_y_continuous(expand = c(0,0)) +
  theme_tufte()

Scatterplots

Available aesthetics include x=continuous, y=continuous, color=<factor>, and size=continuous. Use facet_wrap(~ <factor>) to create sub-plots. Use expand_limits(y = 0) to ensure the y-axis crosses x at zero.

gapminder_gt1952 <- gapminder %>%
  filter(year >= 1952)
ggplot(gapminder_gt1952, aes(x = gdpPercap, y = lifeExp, 
                           color = continent, size = gdpPercap)) + 
  geom_point() +
  scale_x_log10() +
  facet_wrap(~ year)

Box Plots

Boxplots are builtwith geom_boxplot().

gapminder_1952 <- gapminder %>%
  filter(year == 1952)

# Add a title to this graph: "Comparing GDP per capita across continents"
ggplot(gapminder_1952, aes(x = continent, y = gdpPercap)) +
  geom_boxplot() +
  scale_y_log10() +
  labs(title = "Comparing GDP per capita across continents")

3.3 Grouping and Summarizing

Use the summarize verb to summarize grouped variables. Avaiable summary functions include mean, sum, median, min, and max. Save the summarized data to an object for use in visualization.

by_country_year <- gapminder %>%
  filter(continent == "Asia") %>%
  group_by(country, year) %>%
  summarize(medianLifeExp = median(lifeExp),
            maxGdpPercap = max(gdpPercap))

ggplot(by_country_year, aes(x = year, y = medianLifeExp, 
                            color = country, size = maxGdpPercap)) +
  geom_point()

Exploring Data

Get the dataframe dimensions with dim to get a row and column count. Use the which.min() and which.max() functions to find the record with the smallest and largest value of the requested variable.

cars <- read.table("http://s3.amazonaws.com/assets.datacamp.com/course/uva/mtcars_semicolon.csv", sep = ";", header = TRUE)
head(cars)
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
dim(mtcars)
## [1] 32 11
cars[which.min(cars$mpg), ]
##     mpg cyl disp  hp drat   wt  qsec vs am gear carb
## 15 10.4   8  472 205 2.93 5.25 17.98  0  0    3    4

See the levels of a factor variable with the levels() function

levels(as.factor(mtcars$am))
## [1] "0" "1"

Recode a variable by placing a condition in the row argument.

#Assign the value of mtcars to the new variable mtcars2
mtcars2 <- mtcars

#Assign the label "high" to mpgcategory where mpg is greater than or equal to 20
mtcars2$mpgcategory[mtcars2$mpg >= 20] <- "high"

#Assign the label "low" to mpgcategory where mpg is less than 20
mtcars2$mpgcategory[mtcars2$mpg < 20] <- "low"

#Assign mpgcategory as factor to mpgfactor
mtcars2$mpgfactor <- as.factor(mtcars2$mpgcategory)

Create a frequency table with table()

table(mtcars$am)
## 
##  0  1 
## 19 13

****Barplot**** y-var is height, x-var is names.arg.

data <- data.frame(outcome = 0:5, probs = c(0.1, 0.2, 0.3, 0.2, 0.1, 0.1))

# make a bar plot of the probability distribution
barplot(height = data$probs, names.arg = data$outcome)

Create a histogram with hist(), and boxplot with boxplot.

# Make a histogram of the carb variable from the mtcars data set. Set the title to "Carburetors"
# arguments to change the y-axis scale to 0 - 20, label the x-axis and colour the bars red
hist(mtcars$carb, main = "Carburetors", ylim = c(0,20), col = "red", xlab = "Number of Carburetors")

# Make a boxplot of qsec
boxplot(mtcars$qsec)

There is no mode funcdtion in R! Use a sorted table instead.

# Produce a sorted frequency table of `carb` from `mtcars`
sort(table(mtcars$carb), decreasing = TRUE)
## 
##  2  4  1  3  6  8 
## 10 10  7  3  1  1

Similarly, get the IQR with the min and max() functions.

# Minimum value
x <- min(mtcars$mpg)
# Maximum value
y <- max(mtcars$mpg)
# Calculate the range of mpg using x and y
y - x
## [1] 23.5
quantile(mtcars$qsec)
##      0%     25%     50%     75%    100% 
## 14.5000 16.8925 17.7100 18.9000 22.9000
# Calculate the interquartile range of qsec
IQR(mtcars$qsec)
## [1] 2.0075

Calculate standard deviation with std

sd(mtcars$mpg)
## [1] 6.026948

Ordinary Least Squares

*** Checking Assumpions *** Recall the four assumptions of OLS: 1) linear relationships between the response variable and each predictor variable, 2) independent predictor variables, 3) normally distributed residuals, and 4) equal residual variances.

**** Normality **** The residuals should be normally distributed. If they are not, the OLS estimators yield confidence intervals that are too wide or too narrow. Test with a normal probability plot qqnorm(cog_final$residuals) or normal quantile plot qqline(cog_final$residuals) (a bow-shaped deviated pattern indicates non-normality), histogram hist(cog_final$residuals), or residuals plot (look for random scatter around 0). Note: sometimes normality check fails when linearity assumption does not hold.

Create a scatterplot of quantitative data with plot. Create a contingency table of categorical data with table(). Calculate Pearson’s r with cor(var1,var2).

plot(women$weight, women$height, main = "Heights and Weights")

#table(smoking$tobacco,smoking$student)
money <- c(4, 3, 2, 2, 8, 1, 1, 2, 3, 4, 5, 6, 7, 9, 9, 8, 12)
education <- c(3, 4, 6, 9, 3, 3, 1, 2, 1, 4, 5, 7, 10, 8, 7, 6, 9)

# calculate the correlation between X and Y
cor(education,money)
## [1] 0.5846627
# save regression coefficients as object "line"
line<-lm(money~education)

# print the regression coefficients
line
## 
## Call:
## lm(formula = money ~ education)
## 
## Coefficients:
## (Intercept)    education  
##      1.5744       0.6731
# plot Y and X
plot(education,money, main="My Scatterplot")

# add the regression line
abline(line)

We can use abline() to add any line we like, as long as the first argument is the intercept and the second is the slope.

dnorm returns the normal probability of X=x when the mean is mean and standard deviation is sd. pnorm returns the cumulative probability at the specified value (quantile) q. qnorm returns the value (quantile) q at the specified cumulative probability (percentile) p.

# probability of a woman having a hair length of less than 20 centimeters
round(pnorm(20, mean = 25, sd = 5), digits = 2)
## [1] 0.16
round(pnorm((20-25)/5), digits = 2)
## [1] 0.16
# 85th percentile of female hair length
qnorm(.85, mean = 25, sd = 5)
## [1] 30.18217

dbinom returns the binomial probability of X=x successes given size trials and probability of success prob. pbinom returns the cumulative probability (percentile) p at the specified value (quantile) q. qbinom returns the value (quantile) q at the specified cumulative probability (percentile) p.

# probability of answering 5 of 25 questions correctly when p = .2.
dbinom(x = 5, size = 25, prob = .2)
## [1] 0.1960151
# probability of answering >=5 of 25 questions correctly when p = .2.
pbinom(q = 4, size = 25, prob = .2, lower.tail = FALSE)
## [1] 0.5793257
# calculate the 60th percentile
qbinom(p = .6, size = 25, prob = .2)
## [1] 5

Sample data from a set with the sample() function.

For loop.

# initialize an empty vector
new_number <- NULL
for (i in 1:10) {
  new_number[i] <- i
}
print(new_number)
##  [1]  1  2  3  4  5  6  7  8  9 10

Case Studies in Data Cleaning

Example 1

url_sales <- 'http://s3.amazonaws.com/assets.datacamp.com/production/course_1294/datasets/sales.csv'
sales <- read.csv(url_sales)

# Inspect data.
dim(sales)
## [1] 5000   46
head(sales)
##   X             event_id       primary_act_id     secondary_act_id
## 1 1 abcaf1adb99a935fc661 43f0436b905bfa7c2eec b85143bf51323b72e53c
## 2 2 6c56d7f08c95f2aa453c 1a3e9aecd0617706a794 f53529c5679ea6ca5a48
## 3 3 c7ab4524a121f9d687d2 4b677c3f5bec71eec8d1 b85143bf51323b72e53c
## 4 4 394cb493f893be9b9ed1 b1ccea01ad6ef8522796 b85143bf51323b72e53c
## 5 5 55b5f67e618557929f48 91c03a34b562436efa3c b85143bf51323b72e53c
## 6 6 4f10fd8b9f550352bd56 ac4b847b3fde66f2117e 63814f3d63317f1b56c4
##    purch_party_lkup_id
## 1 7dfa56dd7d5956b17587
## 2 4f9e6fc637eaf7b736c2
## 3 6c2545703bd527a7144d
## 4 527d6b1eaffc69ddd882
## 5 8bd62c394a35213bdf52
## 6 3b3a628f83135acd0676
##                                                       event_name
## 1 Xfinity Center Mansfield Premier Parking: Florida Georgia Line
## 2                  Gorge Camping - dave matthews band - sept 3-7
## 3                    Dodge Theatre Adams Street Parking - benise
## 4   Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow
## 5                                  Premier Parking - motley crue
## 6                                      Fast Lane Access: Journey
##                           primary_act_name secondary_act_name
## 1 XFINITY Center Mansfield Premier Parking               NULL
## 2                            Gorge Camping Dave Matthews Band
## 3                            Parking Event               NULL
## 4         Gexa Energy Pavilion VIP Parking               NULL
## 5 White River Amphitheatre Premier Parking               NULL
## 6                         Fast Lane Access            Journey
##   major_cat_name         minor_cat_name la_event_type_cat
## 1           MISC                PARKING           PARKING
## 2           MISC                CAMPING           INVALID
## 3           MISC                PARKING           PARKING
## 4           MISC                PARKING           PARKING
## 5           MISC                PARKING           PARKING
## 6           MISC SPECIAL ENTRY (UPSELL)            UPSELL
##                                                  event_disp_name
## 1 Xfinity Center Mansfield Premier Parking: Florida Georgia Line
## 2                  Gorge Camping - dave matthews band - sept 3-7
## 3                    Dodge Theatre Adams Street Parking - benise
## 4   Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow
## 5                                  Premier Parking - motley crue
## 6                                      Fast Lane Access: Journey
##                                                                                                                                                    ticket_text
## 1    THIS TICKET IS VALID        FOR PARKING ONLY         GOOD THIS DAY ONLY       PREMIER PARKING PASS    XFINITY CENTER,LOTS 4 PM  SAT SEP 12 2015 7:30 PM  
## 2                                                                %OVERNIGHT C A M P I N G%* * * * * *%GORGE CAMPGROUND%* GOOD THIS DATE ONLY *%SEP 3 - 6, 2009
## 3                               ADAMS STREET GARAGE%PARKING FOR 4/21/06 ONLY%DODGE THEATRE PARKING PASS%ENTRANCE ON ADAMS STREET%BENISE%GARAGE OPENS AT 6:00PM
## 4    THIS TICKET IS VALID        FOR PARKING ONLY      GOOD FOR THIS DATE ONLY       VIP PARKING PASS        GEXA ENERGY PAVILION    FRI SEP 02 2011 7:00 PM  
## 5                              THIS TICKET IS VALID%FOR PARKING ONLY%GOOD THIS DATE ONLY%PREMIER PARKING PASS%WHITE RIVER AMPHITHEATRE%SAT JUL 30, 2005 6:00PM
## 6         FAST LANE                  JOURNEY               FAST LANE EVENT         THIS IS NOT A TICKET    SAN MANUEL AMPHITHEATER   SAT JUL 21 2012 7:00 PM  
##   tickets_purchased_qty trans_face_val_amt delivery_type_cd
## 1                     1                 45          eTicket
## 2                     1                 75       TicketFast
## 3                     1                  5       TicketFast
## 4                     1                 20             Mail
## 5                     1                 20             Mail
## 6                     2                 10       TicketFast
##       event_date_time   event_dt presale_dt  onsale_dt
## 1 2015-09-12 23:30:00 2015-09-12       NULL 2015-05-15
## 2 2009-09-05 01:00:00 2009-09-04       NULL 2009-03-13
## 3 2006-04-22 01:30:00 2006-04-21       NULL 2006-02-25
## 4 2011-09-03 00:00:00 2011-09-02       NULL 2011-04-22
## 5 2005-07-31 01:00:00 2005-07-30 2005-03-02 2005-03-04
## 6 2012-07-22 02:00:00 2012-07-21       NULL 2012-04-11
##   sales_ord_create_dttm sales_ord_tran_dt   print_dt timezn_nm
## 1   2015-09-11 18:17:45        2015-09-11 2015-09-12       EST
## 2   2009-07-06 00:00:00        2009-07-05 2009-09-01       PST
## 3   2006-04-05 00:00:00        2006-04-05 2006-04-05       MST
## 4   2011-07-01 17:38:50        2011-07-01 2011-07-06       CST
## 5   2005-06-18 00:00:00        2005-06-18 2005-06-28       PST
## 6   2012-07-21 17:20:18        2012-07-21 2012-07-21       PST
##       venue_city   venue_state venue_postal_cd_sgmt_1
## 1      MANSFIELD MASSACHUSETTS                  02048
## 2         QUINCY    WASHINGTON                  98848
## 3        PHOENIX       ARIZONA                  85003
## 4         DALLAS         TEXAS                  75210
## 5         AUBURN    WASHINGTON                  98092
## 6 SAN BERNARDINO    CALIFORNIA                  92407
##             sales_platform_cd print_flg la_valid_tkt_event_flg  fin_mkt_nm
## 1 www.concerts.livenation.com        T                      N       Boston
## 2                        NULL        T                      N      Seattle
## 3                        NULL        T                      N      Arizona
## 4                        NULL        T                      N       Dallas
## 5                        NULL        T                      N      Seattle
## 6          www.livenation.com        T                      N  Los Angeles
##   web_session_cookie_val gndr_cd age_yr income_amt edu_val
## 1   7dfa56dd7d5956b17587    <NA>   <NA>       <NA>    <NA>
## 2   4f9e6fc637eaf7b736c2    <NA>   <NA>       <NA>    <NA>
## 3   6c2545703bd527a7144d    <NA>   <NA>       <NA>    <NA>
## 4   527d6b1eaffc69ddd882    <NA>   <NA>       <NA>    <NA>
## 5   8bd62c394a35213bdf52    <NA>   <NA>       <NA>    <NA>
## 6   3b3a628f83135acd0676    <NA>   <NA>       <NA>    <NA>
##   edu_1st_indv_val edu_2nd_indv_val adults_in_hh_num married_ind
## 1             <NA>             <NA>             <NA>        <NA>
## 2             <NA>             <NA>             <NA>        <NA>
## 3             <NA>             <NA>             <NA>        <NA>
## 4             <NA>             <NA>             <NA>        <NA>
## 5             <NA>             <NA>             <NA>        <NA>
## 6             <NA>             <NA>             <NA>        <NA>
##   child_present_ind home_owner_ind occpn_val occpn_1st_val occpn_2nd_val
## 1              <NA>           <NA>      <NA>          <NA>          <NA>
## 2              <NA>           <NA>      <NA>          <NA>          <NA>
## 3              <NA>           <NA>      <NA>          <NA>          <NA>
## 4              <NA>           <NA>      <NA>          <NA>          <NA>
## 5              <NA>           <NA>      <NA>          <NA>          <NA>
## 6              <NA>           <NA>      <NA>          <NA>          <NA>
##   dist_to_ven
## 1          NA
## 2          59
## 3          NA
## 4          NA
## 5          NA
## 6          NA
names(sales)
##  [1] "X"                      "event_id"              
##  [3] "primary_act_id"         "secondary_act_id"      
##  [5] "purch_party_lkup_id"    "event_name"            
##  [7] "primary_act_name"       "secondary_act_name"    
##  [9] "major_cat_name"         "minor_cat_name"        
## [11] "la_event_type_cat"      "event_disp_name"       
## [13] "ticket_text"            "tickets_purchased_qty" 
## [15] "trans_face_val_amt"     "delivery_type_cd"      
## [17] "event_date_time"        "event_dt"              
## [19] "presale_dt"             "onsale_dt"             
## [21] "sales_ord_create_dttm"  "sales_ord_tran_dt"     
## [23] "print_dt"               "timezn_nm"             
## [25] "venue_city"             "venue_state"           
## [27] "venue_postal_cd_sgmt_1" "sales_platform_cd"     
## [29] "print_flg"              "la_valid_tkt_event_flg"
## [31] "fin_mkt_nm"             "web_session_cookie_val"
## [33] "gndr_cd"                "age_yr"                
## [35] "income_amt"             "edu_val"               
## [37] "edu_1st_indv_val"       "edu_2nd_indv_val"      
## [39] "adults_in_hh_num"       "married_ind"           
## [41] "child_present_ind"      "home_owner_ind"        
## [43] "occpn_val"              "occpn_1st_val"         
## [45] "occpn_2nd_val"          "dist_to_ven"
# Conclusion:  rows are individual purchases, columns are information about each purchase - good! 

# Get a feel for the data.
str(sales)
## 'data.frame':    5000 obs. of  46 variables:
##  $ X                     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ event_id              : Factor w/ 3746 levels "00071bfcbb27802045b2",..: 2477 1559 2894 846 1216 1119 244 1276 229 2114 ...
##  $ primary_act_id        : Factor w/ 709 levels "00166bacddabff148a03",..: 190 85 214 495 405 482 452 405 59 677 ...
##  $ secondary_act_id      : Factor w/ 535 levels "00a4512e22fe9d3a1350",..: 387 509 387 387 387 185 387 387 513 454 ...
##  $ purch_party_lkup_id   : Factor w/ 4978 levels "000f44312eae9b7e5cae",..: 2476 1565 2110 1620 2742 1185 422 3888 4591 2974 ...
##  $ event_name            : Factor w/ 2512 levels "\"\"\"\"\"\"\"\"weird Al\"\"\"\"\"\"\"\" Yankovic - the mandatory world tour",..: 2494 845 451 763 1412 575 1369 1914 175 1869 ...
##  $ primary_act_name      : Factor w/ 710 levels "3 Doors Down",..: 700 249 462 238 690 205 478 690 268 528 ...
##  $ secondary_act_name    : Factor w/ 537 levels ".38 Special",..: 329 107 329 329 329 227 329 329 54 368 ...
##  $ major_cat_name        : Factor w/ 5 levels "ARTS","CONCERTS",..: 4 4 4 4 4 4 4 4 4 2 ...
##  $ minor_cat_name        : Factor w/ 44 levels "ADULT CONTEMPORARY",..: 30 7 30 30 30 39 30 30 30 16 ...
##  $ la_event_type_cat     : Factor w/ 7 levels "ARTS","CONCERTS",..: 5 4 5 5 5 7 5 5 5 2 ...
##  $ event_disp_name       : Factor w/ 2511 levels "\"\"\"\"\"\"\"\"weird Al\"\"\"\"\"\"\"\" Yankovic - the mandatory world tour",..: 2493 844 450 762 1411 574 1368 1913 174 1868 ...
##  $ ticket_text           : Factor w/ 3746 levels "                                     STYX           W/ THE NASHVILLE SYMPHONY    ASCEND AMPHITHEATER          R"| __truncated__,..: 1375 2019 2292 1452 3147 145 3385 1994 1629 179 ...
##  $ tickets_purchased_qty : int  1 1 1 1 1 2 1 1 1 1 ...
##  $ trans_face_val_amt    : num  45 75 5 20 20 10 30 28 20 25 ...
##  $ delivery_type_cd      : Factor w/ 7 levels "BWC","eTicket",..: 2 6 6 4 4 6 6 4 6 2 ...
##  $ event_date_time       : Factor w/ 3178 levels "2005-02-21 00:30:00",..: 2351 1062 171 1353 78 1487 238 1765 1575 1945 ...
##  $ event_dt              : Factor w/ 1635 levels "2005-02-20","2005-03-19",..: 1301 606 125 788 58 887 161 1056 940 1176 ...
##  $ presale_dt            : Factor w/ 421 levels "2005-01-28","2005-02-23",..: 421 421 421 421 4 421 421 421 421 421 ...
##  $ onsale_dt             : Factor w/ 1040 levels "2004-12-11","2005-01-14",..: 906 382 71 522 5 608 110 695 652 837 ...
##  $ sales_ord_create_dttm : Factor w/ 3860 levels "2005-01-14 00:00:00",..: 2731 781 157 1039 63 1309 222 1436 1334 2104 ...
##  $ sales_ord_tran_dt     : Factor w/ 1849 levels "2005-01-14","2005-01-31",..: 1683 803 158 1046 63 1221 223 1314 1238 1556 ...
##  $ print_dt              : Factor w/ 1867 levels "1900-01-01","2005-01-16",..: 1692 855 185 1070 82 1232 250 1327 1250 1562 ...
##  $ timezn_nm             : Factor w/ 4 levels "CST","EST","MST",..: 2 4 3 1 4 4 2 4 3 4 ...
##  $ venue_city            : Factor w/ 199 levels "ABBOTSFORD","AKRON",..: 99 140 132 48 13 155 77 13 132 157 ...
##  $ venue_state           : Factor w/ 48 levels "ALABAMA","ALBERTA",..: 21 46 3 44 46 6 28 46 3 6 ...
##  $ venue_postal_cd_sgmt_1: Factor w/ 312 levels "01608","02035",..: 3 276 219 198 270 244 23 270 221 256 ...
##  $ sales_platform_cd     : Factor w/ 15 levels "","android.ticketmaster.us",..: 11 10 10 10 10 12 10 11 12 7 ...
##  $ print_flg             : Factor w/ 2 levels "F ","T ": 2 2 2 2 2 2 2 2 2 2 ...
##  $ la_valid_tkt_event_flg: Factor w/ 2 levels "N ","Y ": 1 1 1 1 1 1 1 1 1 2 ...
##  $ fin_mkt_nm            : Factor w/ 51 levels "Arizona","Atlanta",..: 4 43 1 13 43 23 33 43 1 29 ...
##  $ web_session_cookie_val: Factor w/ 4978 levels "000f44312eae9b7e5cae",..: 2476 1565 2110 1620 2742 1185 422 3888 4591 2974 ...
##  $ gndr_cd               : Factor w/ 3 levels "F","M","NULL": NA NA NA NA NA NA 2 NA NA NA ...
##  $ age_yr                : Factor w/ 35 levels "18","20","22",..: NA NA NA NA NA NA 6 NA NA NA ...
##  $ income_amt            : Factor w/ 10 levels "10000","112500",..: NA NA NA NA NA NA 2 NA NA NA ...
##  $ edu_val               : Factor w/ 4 levels "College","Graduate School",..: NA NA NA NA NA NA 3 NA NA NA ...
##  $ edu_1st_indv_val      : Factor w/ 4 levels "College","Graduate School",..: NA NA NA NA NA NA 3 NA NA NA ...
##  $ edu_2nd_indv_val      : Factor w/ 4 levels "College","Graduate School",..: NA NA NA NA NA NA 4 NA NA NA ...
##  $ adults_in_hh_num      : Factor w/ 7 levels "1","2","3","4",..: NA NA NA NA NA NA 4 NA NA NA ...
##  $ married_ind           : Factor w/ 3 levels "0","1","NULL": NA NA NA NA NA NA 1 NA NA NA ...
##  $ child_present_ind     : Factor w/ 3 levels "0","1","NULL": NA NA NA NA NA NA 2 NA NA NA ...
##  $ home_owner_ind        : Factor w/ 3 levels "0","1","NULL": NA NA NA NA NA NA 1 NA NA NA ...
##  $ occpn_val             : Factor w/ 11 levels "Admin Managerial",..: NA NA NA NA NA NA 5 NA NA NA ...
##  $ occpn_1st_val         : Factor w/ 11 levels "Admin Managerial",..: NA NA NA NA NA NA 3 NA NA NA ...
##  $ occpn_2nd_val         : Factor w/ 10 levels "Admin Managerial",..: NA NA NA NA NA NA 5 NA NA NA ...
##  $ dist_to_ven           : int  NA 59 NA NA NA NA NA NA NA NA ...
summary(sales)
##        X                        event_id                 primary_act_id
##  Min.   :   1   84a260b1bcd31e2e75a7:  13   4b677c3f5bec71eec8d1: 208  
##  1st Qu.:1251   6c56d7f08c95f2aa453c:  10   1a3e9aecd0617706a794: 167  
##  Median :2500   6ce493f24421534b4040:   9   6cdc2e270775b7e2f709: 148  
##  Mean   :2500   24d74ef53592d1e950fc:   8   ac4b847b3fde66f2117e: 143  
##  3rd Qu.:3750   b62b844fd17979d24df6:   8   43f0436b905bfa7c2eec: 116  
##  Max.   :5000   b67715ea1653ae26356f:   8   3f510718b680022e6c39: 111  
##                 (Other)             :4944   (Other)             :4107  
##              secondary_act_id           purch_party_lkup_id
##  b85143bf51323b72e53c:3414    4834e7c166768041a7c3:   3    
##  e2981973281c70939168:  51    08cb715b804edce092c1:   2    
##  f53529c5679ea6ca5a48:  47    1d407fe16b5ea4b880f2:   2    
##  9021d10ae169fed0ebb8:  30    23cd7da8896a31c87453:   2    
##  8d74e7609bc261c55a13:  26    27ec6221921b66698dc7:   2    
##  7205f93a45b2e20210bf:  25    29ebf9ce8bad4d323f67:   2    
##  (Other)             :1407    (Other)             :4987    
##                                   event_name  
##  Beyonce - the formation world tour    :  85  
##  Dave Matthews Band                    :  42  
##  Coldplay- A Head Full Of Dreams Tour  :  29  
##  HOUSE OF BLUES PASS THE LINE          :  27  
##  Premier Parking: Dave Matthews Band   :  27  
##  Luke Bryan: Kick The Dust Up Tour 2015:  26  
##  (Other)                               :4764  
##                                       primary_act_name
##  Parking Event                                : 208   
##  Gorge Camping                                : 167   
##  Vip Fast Lane                                : 148   
##  Fast Lane Access                             : 143   
##  XFINITY Center Mansfield Premier Parking     : 116   
##  Verizon Wireless Amph. Irvine Premier Parking: 111   
##  (Other)                                      :4107   
##            secondary_act_name  major_cat_name
##  NULL               :3414     ARTS    :  25  
##  Sasquatch! Festival:  51     CONCERTS:1998  
##  Dave Matthews Band :  47     FAMILY  :   4  
##  Randy Houser       :  30     MISC    :2972  
##  Panic! At The Disco:  26     SPORTS  :   1  
##  Hunter Hayes       :  25                    
##  (Other)            :1407                    
##                 minor_cat_name la_event_type_cat
##  PARKING               :2314   ARTS    : 104    
##  ROCK/POP              : 721   CONCERTS:1906    
##  ALTERNATIVE ROCK      : 402   FAMILY  :   4    
##  SPECIAL ENTRY (UPSELL): 311   INVALID : 171    
##  COUNTRY               : 238   PARKING :2324    
##  CAMPING               : 158   SPORTS  :   1    
##  (Other)               : 856   UPSELL  : 490    
##                                event_disp_name
##  Beyonce - the formation world tour    :  85  
##  Dave Matthews Band                    :  42  
##  Coldplay- A Head Full Of Dreams Tour  :  29  
##  HOUSE OF BLUES PASS THE LINE          :  27  
##  Premier Parking: Dave Matthews Band   :  27  
##  Luke Bryan: Kick The Dust Up Tour 2015:  26  
##  (Other)                               :4764  
##                                                                                                                                                        ticket_text  
##  %OVERNIGHT C A M P I N G%SASQUATCH!%GORGE CAMPGROUND%GOOD THESE DAYS ONLY%MAY 22 - 25, 2009                                                                 :  13  
##  %OVERNIGHT C A M P I N G%* * * * * *%GORGE CAMPGROUND%* GOOD THIS DATE ONLY *%SEP 3 - 6, 2009                                                               :  10  
##     LIVE NATION PRESENTS            COLDPLAY         A HEAD FULL OF DREAMS TOUR       AT&T STADIUM           ALL TAXES INCLUDED     SAT AUG 27 2016 8:00 PM  :   9  
##     LIVE NATION PRESENTS            BEYONCE           THE FORMATION WORLD TOUR         CITI FIELD              RAIN OR SHINE         TUE JUN 07 2016 6:00PM  :   8  
##     Live Nation Presents            BEYONCE           The Formation World Tour     CENTURYLINK FIELD           RAIN OR SHINE         WED MAY 18 2016 6:00PM  :   8  
##     LIVE NATION PRESENTS            BEYONCE           THE FORMATION WORLD TOUR  ROSE BOWL, PASADENA, CA        RAIN OR SHINE          SAT MAY 14 2016 6PM    :   8  
##  (Other)                                                                                                                                                     :4944  
##  tickets_purchased_qty trans_face_val_amt   delivery_type_cd
##  Min.   :1.000         Min.   :   1.00    BWC       :  75   
##  1st Qu.:1.000         1st Qu.:  20.00    eTicket   :1301   
##  Median :1.000         Median :  30.00    ISPU      :  53   
##  Mean   :1.639         Mean   :  77.08    Mail      :1504   
##  3rd Qu.:2.000         3rd Qu.:  85.00    Paperless :  13   
##  Max.   :8.000         Max.   :1520.88    TicketFast:1893   
##                                           UPS       : 161   
##             event_date_time       event_dt         presale_dt  
##  2009-05-23 19:00:00:  17   2008-08-22:  18   NULL      :2892  
##  2009-09-05 01:00:00:  12   2008-05-24:  17   2016-02-09:  71  
##  2016-08-07 00:00:00:  10   2009-05-23:  17   2016-02-03:  40  
##  2016-08-28 01:00:00:   9   2015-07-18:  17   2016-01-19:  39  
##  2008-05-24 19:00:00:   8   2016-05-14:  17   2016-02-15:  33  
##  2008-07-17 02:00:00:   8   2008-08-02:  16   2016-02-16:  32  
##  (Other)            :4936   (Other)   :4898   (Other)   :1893  
##       onsale_dt            sales_ord_create_dttm  sales_ord_tran_dt
##  NULL      : 101   2006-04-08 00:00:00:  19      2016-01-29:  51   
##  2016-02-05:  82   2006-05-06 00:00:00:  19      2016-02-09:  49   
##  2016-01-22:  61   2006-05-05 00:00:00:  15      2016-02-19:  45   
##  2015-04-24:  55   2008-03-10 00:00:00:  14      2016-02-12:  31   
##  2016-02-16:  55   2005-04-02 00:00:00:  12      2016-02-15:  30   
##  2016-02-19:  54   2007-04-21 00:00:00:  12      2016-02-05:  29   
##  (Other)   :4592   (Other)            :4909      (Other)   :4765   
##        print_dt    timezn_nm         venue_city           venue_state  
##  NULL      : 424   CST:1175   PHOENIX     : 213   CALIFORNIA    : 712  
##  1900-01-01:  35   EST:2353   ATLANTA     : 210   NEW YORK      : 381  
##  2016-02-05:  23   MST: 285   CHARLOTTE   : 171   WASHINGTON    : 347  
##  2016-02-12:  20   PST:1187   IRVINE      : 152   INDIANA       : 296  
##  2016-02-23:  20              MANSFIELD   : 150   NORTH CAROLINA: 293  
##  2016-02-26:  20              INDIANAPOLIS: 149   TEXAS         : 270  
##  (Other)   :4458              (Other)     :3955   (Other)       :2701  
##  venue_postal_cd_sgmt_1                   sales_platform_cd print_flg
##  98848  : 245           NULL                       :2421    F : 459  
##  92618  : 152           www.concerts.livenation.com:1097    T :4541  
##  02048  : 150           www.ticketmaster.com       : 688             
##  46204  : 148           mobile.livenation.us       : 198             
##  46060  : 147           iphone.ticketmaster.us     : 173             
##  30303  : 144           mobile.ticketmaster.us     : 122             
##  (Other):4014           (Other)                    : 301             
##  la_valid_tkt_event_flg         fin_mkt_nm  
##  N :2985                New York     : 462  
##  Y :2015                Boston       : 381  
##                         Seattle      : 339  
##                         Los Angeles  : 332  
##                         Indiana-Ohio : 295  
##                         N. California: 283  
##                         (Other)      :2908  
##           web_session_cookie_val gndr_cd         age_yr       income_amt  
##  4834e7c166768041a7c3:   3       F   :  92   NULL   :  47   NULL   :  61  
##  08cb715b804edce092c1:   2       M   :  85   24     :  13   62500  :  41  
##  1d407fe16b5ea4b880f2:   2       NULL:  38   34     :  10   200000 :  28  
##  23cd7da8896a31c87453:   2       NA's:4785   30     :   9   87500  :  28  
##  27ec6221921b66698dc7:   2                   44     :   9   45000  :  19  
##  29ebf9ce8bad4d323f67:   2                   (Other): 127   (Other):  38  
##  (Other)             :4987                   NA's   :4785   NA's   :4785  
##             edu_val            edu_1st_indv_val        edu_2nd_indv_val
##  College        :  41   College        :  36    College        :  27   
##  Graduate School:  25   Graduate School:  18    Graduate School:  12   
##  High School    :  73   High School    :  58    High School    :  47   
##  NULL           :  76   NULL           : 103    NULL           : 129   
##  NA's           :4785   NA's           :4785    NA's           :4785   
##                                                                        
##                                                                        
##  adults_in_hh_num married_ind child_present_ind home_owner_ind
##  1      :  52     0   :  52   0   :  57         0   :   8     
##  2      :  48     1   : 105   1   :  73         1   : 130     
##  NULL   :  39     NULL:  58   NULL:  85         NULL:  77     
##  3      :  30     NA's:4785   NA's:4785         NA's:4785     
##  4      :  25                                                 
##  (Other):  21                                                 
##  NA's   :4785                                                 
##                   occpn_val                   occpn_1st_val 
##  NULL                  : 136   NULL                  : 138  
##  Professional Technical:  27   Professional Technical:  29  
##  Clerical White Collar :  14   Admin Managerial      :  11  
##  Craftsman Blue Collar :  13   Clerical White Collar :  10  
##  Homemaker             :   8   Craftsman Blue Collar :  10  
##  (Other)               :  17   (Other)               :  17  
##  NA's                  :4785   NA's                  :4785  
##                 occpn_2nd_val   dist_to_ven    
##  NULL                  : 152   Min.   :   0.0  
##  Professional Technical:  21   1st Qu.:  12.0  
##  Clerical WhiteCollar  :  13   Median :  26.0  
##  Homemaker             :  12   Mean   : 158.2  
##  Craftsman BlueCollar  :   7   3rd Qu.:  77.5  
##  (Other)               :  10   Max.   :2548.0  
##  NA's                  :4785   NA's   :4677
library(dplyr)
glimpse(sales)
## Observations: 5,000
## Variables: 46
## $ X                      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...
## $ event_id               <fct> abcaf1adb99a935fc661, 6c56d7f08c95f2aa4...
## $ primary_act_id         <fct> 43f0436b905bfa7c2eec, 1a3e9aecd0617706a...
## $ secondary_act_id       <fct> b85143bf51323b72e53c, f53529c5679ea6ca5...
## $ purch_party_lkup_id    <fct> 7dfa56dd7d5956b17587, 4f9e6fc637eaf7b73...
## $ event_name             <fct> Xfinity Center Mansfield Premier Parkin...
## $ primary_act_name       <fct> XFINITY Center Mansfield Premier Parkin...
## $ secondary_act_name     <fct> NULL, Dave Matthews Band, NULL, NULL, N...
## $ major_cat_name         <fct> MISC, MISC, MISC, MISC, MISC, MISC, MIS...
## $ minor_cat_name         <fct> PARKING, CAMPING, PARKING, PARKING, PAR...
## $ la_event_type_cat      <fct> PARKING, INVALID, PARKING, PARKING, PAR...
## $ event_disp_name        <fct> Xfinity Center Mansfield Premier Parkin...
## $ ticket_text            <fct>    THIS TICKET IS VALID        FOR PARK...
## $ tickets_purchased_qty  <int> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 4, ...
## $ trans_face_val_amt     <dbl> 45, 75, 5, 20, 20, 10, 30, 28, 20, 25, ...
## $ delivery_type_cd       <fct> eTicket, TicketFast, TicketFast, Mail, ...
## $ event_date_time        <fct> 2015-09-12 23:30:00, 2009-09-05 01:00:0...
## $ event_dt               <fct> 2015-09-12, 2009-09-04, 2006-04-21, 201...
## $ presale_dt             <fct> NULL, NULL, NULL, NULL, 2005-03-02, NUL...
## $ onsale_dt              <fct> 2015-05-15, 2009-03-13, 2006-02-25, 201...
## $ sales_ord_create_dttm  <fct> 2015-09-11 18:17:45, 2009-07-06 00:00:0...
## $ sales_ord_tran_dt      <fct> 2015-09-11, 2009-07-05, 2006-04-05, 201...
## $ print_dt               <fct> 2015-09-12, 2009-09-01, 2006-04-05, 201...
## $ timezn_nm              <fct> EST, PST, MST, CST, PST, PST, EST, PST,...
## $ venue_city             <fct> MANSFIELD, QUINCY, PHOENIX, DALLAS, AUB...
## $ venue_state            <fct> MASSACHUSETTS, WASHINGTON, ARIZONA, TEX...
## $ venue_postal_cd_sgmt_1 <fct> 02048, 98848, 85003, 75210, 98092, 9240...
## $ sales_platform_cd      <fct> www.concerts.livenation.com, NULL, NULL...
## $ print_flg              <fct> T , T , T , T , T , T , T , T , T , T ,...
## $ la_valid_tkt_event_flg <fct> N , N , N , N , N , N , N , N , N , Y ,...
## $ fin_mkt_nm             <fct> Boston, Seattle, Arizona, Dallas, Seatt...
## $ web_session_cookie_val <fct> 7dfa56dd7d5956b17587, 4f9e6fc637eaf7b73...
## $ gndr_cd                <fct> NA, NA, NA, NA, NA, NA, M, NA, NA, NA, ...
## $ age_yr                 <fct> NA, NA, NA, NA, NA, NA, 28, NA, NA, NA,...
## $ income_amt             <fct> NA, NA, NA, NA, NA, NA, 112500, NA, NA,...
## $ edu_val                <fct> NA, NA, NA, NA, NA, NA, High School, NA...
## $ edu_1st_indv_val       <fct> NA, NA, NA, NA, NA, NA, High School, NA...
## $ edu_2nd_indv_val       <fct> NA, NA, NA, NA, NA, NA, NULL, NA, NA, N...
## $ adults_in_hh_num       <fct> NA, NA, NA, NA, NA, NA, 4, NA, NA, NA, ...
## $ married_ind            <fct> NA, NA, NA, NA, NA, NA, 0, NA, NA, NA, ...
## $ child_present_ind      <fct> NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, ...
## $ home_owner_ind         <fct> NA, NA, NA, NA, NA, NA, 0, NA, NA, NA, ...
## $ occpn_val              <fct> NA, NA, NA, NA, NA, NA, NULL, NA, NA, N...
## $ occpn_1st_val          <fct> NA, NA, NA, NA, NA, NA, Craftsman Blue ...
## $ occpn_2nd_val          <fct> NA, NA, NA, NA, NA, NA, NULL, NA, NA, N...
## $ dist_to_ven            <int> NA, 59, NA, NA, NA, NA, NA, NA, NA, NA,...
# Remove first column (obs no).
sales2 <- sales[,-1]

# Remove first 4 columns (codes) and last 15 columns (too many NAs).
sales3 <- sales2[,c(5:(ncol(sales2) - 15))]

# Separate the date times into dates and times.
library(tidyr)
sales4 <- separate(sales3, event_date_time,
                   into = c("event_dt", "event_time"), sep = " ")
sales5 <- separate(sales4, sales_ord_create_dttm, 
                   into = c("ord_create_dt", "ord_create_time"), sep = " ")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 4 rows
## [2516, 3863, 4082, 4183].
# Second command threw warnings.  View problem records.
issues <- c(2516, 3863, 4082, 4183)
sales3$sales_ord_create_dttm[issues]
## [1] NULL NULL NULL NULL
## 3860 Levels: 2005-01-14 00:00:00 2005-01-31 00:00:00 ... NULL
# For comparison, a well-behaved value of sales_ord_create_dttm.
sales3$sales_ord_create_dttm[2517]
## [1] 2013-08-04 23:07:19
## 3860 Levels: 2005-01-14 00:00:00 2005-01-31 00:00:00 ... NULL
# Issue is missing values.  May need to drop records.

# Coerce dates strings into dates.  Use fact that data columns have "dt" in name.
library(stringr)
date_cols <- str_detect(colnames(sales5), "dt")
library(lubridate)
sales5[, date_cols] <- lapply(sales5[,date_cols], ymd)
## Warning: 2892 failed to parse.
## Warning: 101 failed to parse.
## Warning: 4 failed to parse.
## Warning: 424 failed to parse.
# Note the warning messages.  Are they due to NAs?
missing <- lapply(sales5[, date_cols], is.na)
sapply(missing, sum)
##          event_dt        presale_dt         onsale_dt     ord_create_dt 
##                 0              2892               101                 4 
## sales_ord_tran_dt          print_dt 
##                 0               424
# Conclusion: the number of NAs in each column match the numbers from the warning messages, so missing data is the culprit.

# Combine the venue_city and venue_state columns
sales6 <- unite(sales5, venue_city_state, venue_city, venue_state, sep = ", ")

Example 2

library(readxl)
# Read Excel data file.  Discard first row (title).
mbta_url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1294/datasets/mbta.xlsx"
# Cannot read Excel directly from internet, so downloaded to local drive.
# Following command works, but content is unreadable.  Comment out.
# download.file(mbta_url, file.path("Programs/Data", "mbta.xlsx"))
mbta_path <- file.path("C:/Users/mpfol/OneDrive/Documents/Data Analysis/Programs/Data", "mbta.xlsx")
mbta <- read_excel(mbta_path, skip = 1)

# Examine organization.
str(mbta)
## Classes 'tbl_df', 'tbl' and 'data.frame':    11 obs. of  60 variables:
##  $ X__1   : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ mode   : chr  "All Modes by Qtr" "Boat" "Bus" "Commuter Rail" ...
##  $ 2007-01: chr  "NA" "4" "335.81900000000002" "142.19999999999999" ...
##  $ 2007-02: chr  "NA" "3.6" "338.67500000000001" "138.5" ...
##  $ 2007-03: num  1188 40 340 138 459 ...
##  $ 2007-04: chr  "NA" "4.3" "352.16199999999998" "139.5" ...
##  $ 2007-05: chr  "NA" "4.9000000000000004" "354.36700000000002" "139" ...
##  $ 2007-06: num  1246 5.8 350.5 143 477 ...
##  $ 2007-07: chr  "NA" "6.5209999999999999" "357.51900000000001" "142.39099999999999" ...
##  $ 2007-08: chr  "NA" "6.5720000000000001" "355.47899999999998" "142.364" ...
##  $ 2007-09: num  1256.57 5.47 372.6 143.05 499.57 ...
##  $ 2007-10: chr  "NA" "5.1449999999999996" "368.84699999999998" "146.542" ...
##  $ 2007-11: chr  "NA" "3.7629999999999999" "330.82600000000002" "145.089" ...
##  $ 2007-12: num  1216.89 2.98 312.92 141.59 448.27 ...
##  $ 2008-01: chr  "NA" "3.1749999999999998" "340.32400000000001" "142.14500000000001" ...
##  $ 2008-02: chr  "NA" "3.1110000000000002" "352.90499999999997" "142.607" ...
##  $ 2008-03: num  1253.52 3.51 361.15 137.45 494.05 ...
##  $ 2008-04: chr  "NA" "4.1639999999999997" "368.18900000000002" "140.38900000000001" ...
##  $ 2008-05: chr  "NA" "4.0149999999999997" "363.90300000000002" "142.58500000000001" ...
##  $ 2008-06: num  1314.82 5.19 362.96 142.06 518.35 ...
##  $ 2008-07: chr  "NA" "6.016" "370.92099999999999" "145.73099999999999" ...
##  $ 2008-08: chr  "NA" "5.8" "361.05700000000002" "144.565" ...
##  $ 2008-09: num  1307.04 4.59 389.54 141.91 517.32 ...
##  $ 2008-10: chr  "NA" "4.2850000000000001" "357.97399999999999" "151.95699999999999" ...
##  $ 2008-11: chr  "NA" "3.488" "345.423" "152.952" ...
##  $ 2008-12: num  1232.65 3.01 325.77 140.81 446.74 ...
##  $ 2009-01: chr  "NA" "3.0139999999999998" "338.53199999999998" "141.44800000000001" ...
##  $ 2009-02: chr  "NA" "3.1960000000000002" "360.41199999999998" "143.529" ...
##  $ 2009-03: num  1209.79 3.33 353.69 142.89 467.22 ...
##  $ 2009-04: chr  "NA" "4.0490000000000004" "359.38" "142.34" ...
##  $ 2009-05: chr  "NA" "4.1189999999999998" "354.75" "144.22499999999999" ...
##  $ 2009-06: num  1233.1 4.9 347.9 142 473.1 ...
##  $ 2009-07: chr  "NA" "6.444" "339.47699999999998" "137.691" ...
##  $ 2009-08: chr  "NA" "5.9029999999999996" "332.661" "139.15799999999999" ...
##  $ 2009-09: num  1230.5 4.7 374.3 139.1 500.4 ...
##  $ 2009-10: chr  "NA" "4.2119999999999997" "385.86799999999999" "137.10400000000001" ...
##  $ 2009-11: chr  "NA" "3.5760000000000001" "366.98" "129.34299999999999" ...
##  $ 2009-12: num  1207.85 3.11 332.39 126.07 440.93 ...
##  $ 2010-01: chr  "NA" "3.2069999999999999" "362.226" "130.91" ...
##  $ 2010-02: chr  "NA" "3.1949999999999998" "361.13799999999998" "131.91800000000001" ...
##  $ 2010-03: num  1208.86 3.48 373.44 131.25 483.4 ...
##  $ 2010-04: chr  "NA" "4.452" "378.61099999999999" "131.72200000000001" ...
##  $ 2010-05: chr  "NA" "4.415" "380.17099999999999" "128.80000000000001" ...
##  $ 2010-06: num  1244.41 5.41 363.27 129.14 490.26 ...
##  $ 2010-07: chr  "NA" "6.5129999999999999" "353.04" "122.935" ...
##  $ 2010-08: chr  "NA" "6.2690000000000001" "343.68799999999999" "129.732" ...
##  $ 2010-09: num  1225.5 4.7 381.6 132.9 521.1 ...
##  $ 2010-10: chr  "NA" "4.4020000000000001" "384.98700000000002" "131.03299999999999" ...
##  $ 2010-11: chr  "NA" "3.7309999999999999" "367.95499999999998" "130.88900000000001" ...
##  $ 2010-12: num  1216.26 3.16 326.34 121.42 450.43 ...
##  $ 2011-01: chr  "NA" "3.14" "334.95800000000003" "128.39599999999999" ...
##  $ 2011-02: chr  "NA" "3.2839999999999998" "346.23399999999998" "125.46299999999999" ...
##  $ 2011-03: num  1223.45 3.67 380.4 134.37 516.73 ...
##  $ 2011-04: chr  "NA" "4.2510000000000003" "380.44600000000003" "134.16900000000001" ...
##  $ 2011-05: chr  "NA" "4.431" "385.28899999999999" "136.13999999999999" ...
##  $ 2011-06: num  1302.41 5.47 376.32 135.58 529.53 ...
##  $ 2011-07: chr  "NA" "6.5810000000000004" "361.58499999999998" "132.41" ...
##  $ 2011-08: chr  "NA" "6.7329999999999997" "353.79300000000001" "130.61600000000001" ...
##  $ 2011-09: num  1291 5 388 137 550 ...
##  $ 2011-10: chr  "NA" "4.484" "398.45600000000002" "128.72" ...
head(mbta)
## # A tibble: 6 x 60
##    X__1 mode   `2007-01` `2007-02` `2007-03` `2007-04` `2007-05` `2007-06`
##   <dbl> <chr>  <chr>     <chr>         <dbl> <chr>     <chr>         <dbl>
## 1    1. All M~ NA        NA           1188.  NA        NA          1246.  
## 2    2. Boat   4         3.6            40.0 4.3       4.900000~      5.80
## 3    3. Bus    335.8190~ 338.6750~     340.  352.1619~ 354.3670~    351.  
## 4    4. Commu~ 142.1999~ 138.5         138.  139.5     139          143.  
## 5    5. Heavy~ 435.2939~ 448.2710~     459.  472.2010~ 474.5790~    477.  
## 6    6. Light~ 227.2309~ 240.262       241.  255.5569~ 248.262      246.  
## # ... with 52 more variables: `2007-07` <chr>, `2007-08` <chr>,
## #   `2007-09` <dbl>, `2007-10` <chr>, `2007-11` <chr>, `2007-12` <dbl>,
## #   `2008-01` <chr>, `2008-02` <chr>, `2008-03` <dbl>, `2008-04` <chr>,
## #   `2008-05` <chr>, `2008-06` <dbl>, `2008-07` <chr>, `2008-08` <chr>,
## #   `2008-09` <dbl>, `2008-10` <chr>, `2008-11` <chr>, `2008-12` <dbl>,
## #   `2009-01` <chr>, `2009-02` <chr>, `2009-03` <dbl>, `2009-04` <chr>,
## #   `2009-05` <chr>, `2009-06` <dbl>, `2009-07` <chr>, `2009-08` <chr>,
## #   `2009-09` <dbl>, `2009-10` <chr>, `2009-11` <chr>, `2009-12` <dbl>,
## #   `2010-01` <chr>, `2010-02` <chr>, `2010-03` <dbl>, `2010-04` <chr>,
## #   `2010-05` <chr>, `2010-06` <dbl>, `2010-07` <chr>, `2010-08` <chr>,
## #   `2010-09` <dbl>, `2010-10` <chr>, `2010-11` <chr>, `2010-12` <dbl>,
## #   `2011-01` <chr>, `2011-02` <chr>, `2011-03` <dbl>, `2011-04` <chr>,
## #   `2011-05` <chr>, `2011-06` <dbl>, `2011-07` <chr>, `2011-08` <chr>,
## #   `2011-09` <dbl>, `2011-10` <chr>
summary(mbta)
##       X__1          mode             2007-01            2007-02         
##  Min.   : 1.0   Length:11          Length:11          Length:11         
##  1st Qu.: 3.5   Class :character   Class :character   Class :character  
##  Median : 6.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 6.0                                                           
##  3rd Qu.: 8.5                                                           
##  Max.   :11.0                                                           
##     2007-03           2007-04            2007-05         
##  Min.   :   0.114   Length:11          Length:11         
##  1st Qu.:   9.278   Class :character   Class :character  
##  Median : 137.700   Mode  :character   Mode  :character  
##  Mean   : 330.293                                        
##  3rd Qu.: 399.225                                        
##  Max.   :1204.725                                        
##     2007-06           2007-07            2007-08         
##  Min.   :   0.096   Length:11          Length:11         
##  1st Qu.:   5.700   Class :character   Class :character  
##  Median : 143.000   Mode  :character   Mode  :character  
##  Mean   : 339.846                                        
##  3rd Qu.: 413.788                                        
##  Max.   :1246.129                                        
##     2007-09           2007-10            2007-11         
##  Min.   :  -0.007   Length:11          Length:11         
##  1st Qu.:   5.539   Class :character   Class :character  
##  Median : 143.051   Mode  :character   Mode  :character  
##  Mean   : 352.554                                        
##  3rd Qu.: 436.082                                        
##  Max.   :1310.764                                        
##     2007-12           2008-01            2008-02         
##  Min.   :  -0.060   Length:11          Length:11         
##  1st Qu.:   4.385   Class :character   Class :character  
##  Median : 141.585   Mode  :character   Mode  :character  
##  Mean   : 321.588                                        
##  3rd Qu.: 380.594                                        
##  Max.   :1216.890                                        
##     2008-03           2008-04            2008-05         
##  Min.   :   0.058   Length:11          Length:11         
##  1st Qu.:   5.170   Class :character   Class :character  
##  Median : 137.453   Mode  :character   Mode  :character  
##  Mean   : 345.604                                        
##  3rd Qu.: 427.601                                        
##  Max.   :1274.031                                        
##     2008-06           2008-07            2008-08         
##  Min.   :   0.060   Length:11          Length:11         
##  1st Qu.:   5.742   Class :character   Class :character  
##  Median : 142.057   Mode  :character   Mode  :character  
##  Mean   : 359.667                                        
##  3rd Qu.: 440.656                                        
##  Max.   :1320.728                                        
##     2008-09           2008-10            2008-11         
##  Min.   :   0.021   Length:11          Length:11         
##  1st Qu.:   5.691   Class :character   Class :character  
##  Median : 141.907   Mode  :character   Mode  :character  
##  Mean   : 362.099                                        
##  3rd Qu.: 453.430                                        
##  Max.   :1338.015                                        
##     2008-12           2009-01            2009-02         
##  Min.   :  -0.015   Length:11          Length:11         
##  1st Qu.:   4.689   Class :character   Class :character  
##  Median : 140.810   Mode  :character   Mode  :character  
##  Mean   : 319.882                                        
##  3rd Qu.: 386.255                                        
##  Max.   :1232.655                                        
##     2009-03           2009-04            2009-05         
##  Min.   :  -0.050   Length:11          Length:11         
##  1st Qu.:   5.003   Class :character   Class :character  
##  Median : 142.893   Mode  :character   Mode  :character  
##  Mean   : 330.142                                        
##  3rd Qu.: 410.455                                        
##  Max.   :1210.912                                        
##     2009-06           2009-07            2009-08         
##  Min.   :  -0.079   Length:11          Length:11         
##  1st Qu.:   5.845   Class :character   Class :character  
##  Median : 142.006   Mode  :character   Mode  :character  
##  Mean   : 333.194                                        
##  3rd Qu.: 410.482                                        
##  Max.   :1233.085                                        
##     2009-09           2009-10            2009-11         
##  Min.   :  -0.035   Length:11          Length:11         
##  1st Qu.:   5.693   Class :character   Class :character  
##  Median : 139.087   Mode  :character   Mode  :character  
##  Mean   : 346.687                                        
##  3rd Qu.: 437.332                                        
##  Max.   :1291.564                                        
##     2009-12           2010-01            2010-02         
##  Min.   :  -0.022   Length:11          Length:11         
##  1st Qu.:   4.784   Class :character   Class :character  
##  Median : 126.066   Mode  :character   Mode  :character  
##  Mean   : 312.962                                        
##  3rd Qu.: 386.659                                        
##  Max.   :1207.845                                        
##     2010-03           2010-04            2010-05         
##  Min.   :   0.012   Length:11          Length:11         
##  1st Qu.:   5.274   Class :character   Class :character  
##  Median : 131.252   Mode  :character   Mode  :character  
##  Mean   : 332.726                                        
##  3rd Qu.: 428.420                                        
##  Max.   :1225.556                                        
##     2010-06           2010-07            2010-08         
##  Min.   :   0.008   Length:11          Length:11         
##  1st Qu.:   6.436   Class :character   Class :character  
##  Median : 129.144   Mode  :character   Mode  :character  
##  Mean   : 335.964                                        
##  3rd Qu.: 426.769                                        
##  Max.   :1244.409                                        
##     2010-09           2010-10            2010-11         
##  Min.   :   0.001   Length:11          Length:11         
##  1st Qu.:   5.567   Class :character   Class :character  
##  Median : 132.892   Mode  :character   Mode  :character  
##  Mean   : 346.524                                        
##  3rd Qu.: 451.361                                        
##  Max.   :1293.117                                        
##     2010-12           2011-01            2011-02         
##  Min.   :  -0.004   Length:11          Length:11         
##  1st Qu.:   4.466   Class :character   Class :character  
##  Median : 121.422   Mode  :character   Mode  :character  
##  Mean   : 312.917                                        
##  3rd Qu.: 388.385                                        
##  Max.   :1216.262                                        
##     2011-03          2011-04            2011-05         
##  Min.   :   0.05   Length:11          Length:11         
##  1st Qu.:   6.03   Class :character   Class :character  
##  Median : 134.37   Mode  :character   Mode  :character  
##  Mean   : 345.17                                        
##  3rd Qu.: 448.56                                        
##  Max.   :1286.66                                        
##     2011-06           2011-07            2011-08         
##  Min.   :   0.054   Length:11          Length:11         
##  1st Qu.:   6.926   Class :character   Class :character  
##  Median : 135.581   Mode  :character   Mode  :character  
##  Mean   : 353.331                                        
##  3rd Qu.: 452.923                                        
##  Max.   :1302.414                                        
##     2011-09           2011-10         
##  Min.   :   0.043   Length:11         
##  1st Qu.:   6.660   Class :character  
##  Median : 136.901   Mode  :character  
##  Mean   : 362.555                     
##  3rd Qu.: 469.204                     
##  Max.   :1348.754
# Conclusion: observations stored as columns rather than as rows.
# Need to remove rows 1, 7, and 11 (All Modes By Qtr, Pct Chg / Yr, and TOTAL). 
# Need to remove column 1 (row number)
# Gather the columns (yyyy-dd) into key-value pairs.
# Spread the modes into columns.
mbta2 <- mbta[-c(1,7,11),]
mbta3 <- mbta2[,-1]
library(tidyr)
mbta4 <- gather(mbta3, month, thou_riders, -c(mode))
mbta4$thou_riders <- as.numeric(mbta4$thou_riders)
mbta5 <- spread(mbta4, mode, thou_riders)
mbta6 <- separate(mbta5, col = "month", into = c("year", "month"), sep = "-")

# Screen for obvious mistakes and/or outliers.
summary(mbta6)
##      year              month                Boat             Bus       
##  Length:58          Length:58          Min.   : 2.985   Min.   :312.9  
##  Class :character   Class :character   1st Qu.: 3.494   1st Qu.:345.6  
##  Mode  :character   Mode  :character   Median : 4.293   Median :359.9  
##                                        Mean   : 5.068   Mean   :358.6  
##                                        3rd Qu.: 5.356   3rd Qu.:372.2  
##                                        Max.   :40.000   Max.   :398.5  
##  Commuter Rail     Heavy Rail      Light Rail     Private Bus   
##  Min.   :121.4   Min.   :435.3   Min.   :194.4   Min.   :2.213  
##  1st Qu.:131.4   1st Qu.:471.1   1st Qu.:220.6   1st Qu.:2.641  
##  Median :138.8   Median :487.3   Median :231.9   Median :2.820  
##  Mean   :137.4   Mean   :489.3   Mean   :233.0   Mean   :3.352  
##  3rd Qu.:142.4   3rd Qu.:511.3   3rd Qu.:244.5   3rd Qu.:4.167  
##  Max.   :153.0   Max.   :554.9   Max.   :271.1   Max.   :4.878  
##       RIDE       Trackless Trolley
##  Min.   :4.900   Min.   : 5.777   
##  1st Qu.:5.965   1st Qu.:11.679   
##  Median :6.615   Median :12.598   
##  Mean   :6.604   Mean   :12.125   
##  3rd Qu.:7.149   3rd Qu.:13.320   
##  Max.   :8.598   Max.   :15.109
# Column Boat looks suspicious.  40 should be 4
hist(mbta6$Boat)

i <- which(mbta6$Boat == 40)
mbta6$Boat[i] <- 4
hist(mbta6$Boat)

mbta_boat <- mbta4 %>% filter(mode == "Boat" | mode == "Trackless Trolley")
# Look at Boat and Trackless Trolley ridership over time (don't change)
ggplot(mbta_boat, aes(x = month, y = thou_riders, col = mode)) +  geom_point() + 
  scale_x_discrete(name = "Month", breaks = c(200701, 200801, 200901, 201001, 201101)) + 
  scale_y_continuous(name = "Avg Weekday Ridership (thousands)")

# Look at all T ridership over time (don't change)
ggplot(mbta4, aes(x = month, y = thou_riders, col = mode)) + geom_point() + 
  scale_x_discrete(name = "Month", breaks = c(200701, 200801, 200901, 201001, 201101)) +  
  scale_y_continuous(name = "Avg Weekday Ridership (thousands)")

Example 3

# Read data.  For large data sets, use fread() from the data.table package.
library(data.table)
food_url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1294/datasets/food.csv"
food <- fread(food_url, data.table = FALSE)

# Examine organization.
str(food)
## 'data.frame':    1500 obs. of  160 variables:
##  $ V1                                        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ code                                      : int  100030 100050 100079 100094 100124 100136 100194 100221 100257 100258 ...
##  $ url                                       : chr  "http://world-en.openfoodfacts.org/product/3222475745867/confiture-de-fraise-fraise-des-bois-au-sucre-de-canne-casino-delices" "http://world-en.openfoodfacts.org/product/5410976880110/guylian-sea-shells-selection" "http://world-en.openfoodfacts.org/product/3264750423503/pates-de-fruits-aromatisees-jacquot" "http://world-en.openfoodfacts.org/product/8006040247001/nata-vegetal-a-base-de-soja-valsoia" ...
##  $ creator                                   : chr  "sebleouf" "foodorigins" "domdom26" "javichu" ...
##  $ created_t                                 : int  1424747544 1450316429 1428674916 1420416591 1420501121 1437983923 1442420988 1435686217 1436991777 1400516512 ...
##  $ created_datetime                          : chr  "2015-02-24T03:12:24Z" "2015-12-17T01:40:29Z" "2015-04-10T14:08:36Z" "2015-01-05T00:09:51Z" ...
##  $ last_modified_t                           : int  1438445887 1450817956 1428739289 1420417876 1445700917 1445577476 1442420988 1451405288 1436991779 1437236856 ...
##  $ last_modified_datetime                    : chr  "2015-08-01T16:18:07Z" "2015-12-22T20:59:16Z" "2015-04-11T08:01:29Z" "2015-01-05T00:31:16Z" ...
##  $ product_name                              : chr  "Confiture de fraise fraise des bois au sucre de canne" "Guylian Sea Shells Selection" "Pâtes de fruits aromatisées" "Nata vegetal a base de soja &quot;Valsoia&quot;" ...
##  $ generic_name                              : chr  "" "" "Pâtes de fruits" "Nata vegetal a base de soja" ...
##  $ quantity                                  : chr  "265 g" "375g" "1 kg" "200 ml" ...
##  $ packaging                                 : chr  "Bocal,Verre" "Plastic,Box" "Carton,plastique" "Tetra Brik" ...
##  $ packaging_tags                            : chr  "bocal,verre" "plastic,box" "carton,plastique" "tetra-brik" ...
##  $ brands                                    : chr  "Casino Délices" "Guylian" "Jacquot" "Valsoia,//Propiedad de://,Valsoia S.p.A." ...
##  $ brands_tags                               : chr  "casino-delices" "guylian" "jacquot" "valsoia,propiedad-de,valsoia-s-p-a" ...
##  $ categories                                : chr  "Aliments et boissons à base de végétaux,Aliments d'origine végétale,Aliments à base de fruits et de légu"| __truncated__ "Chocolate" "pâtes de fruits" "Alimentos y bebidas de origen vegetal,Alimentos de origen vegetal,Natas vegetales,Natas vegetales a base de soj"| __truncated__ ...
##  $ categories_tags                           : chr  "en:plant-based-foods-and-beverages,en:plant-based-foods,en:fruits-and-vegetables-based-foods,en:breakfasts,en:s"| __truncated__ "en:sugary-snacks,en:chocolates" "en:plant-based-foods-and-beverages,en:plant-based-foods,en:fruits-and-vegetables-based-foods,en:sugary-snacks,e"| __truncated__ "en:plant-based-foods-and-beverages,en:plant-based-foods,en:plant-based-creams,en:plant-based-creams-for-cooking"| __truncated__ ...
##  $ categories_en                             : chr  "Plant-based foods and beverages,Plant-based foods,Fruits and vegetables based foods,Breakfasts,Spreads,Fruits b"| __truncated__ "Sugary snacks,Chocolates" "Plant-based foods and beverages,Plant-based foods,Fruits and vegetables based foods,Sugary snacks,Confectioneri"| __truncated__ "Plant-based foods and beverages,Plant-based foods,Plant-based creams,Plant-based creams for cooking,Soy-based c"| __truncated__ ...
##  $ origins                                   : chr  "" "" "" "" ...
##  $ origins_tags                              : chr  "" "" "" "" ...
##  $ manufacturing_places                      : chr  "France" "Belgium" "" "Italia" ...
##  $ manufacturing_places_tags                 : chr  "france" "belgium" "" "italia" ...
##  $ labels                                    : chr  "" "" "" "Vegetariano,Vegano,Sin gluten,Sin OMG,Sin lactosa" ...
##  $ labels_tags                               : chr  "" "" "" "en:vegetarian,en:vegan,en:gluten-free,en:no-gmos,en:no-lactose" ...
##  $ labels_en                                 : chr  "" "" "" "Vegetarian,Vegan,Gluten-free,No GMOs,No lactose" ...
##  $ emb_codes                                 : chr  "EMB 78015" "" "" "" ...
##  $ emb_codes_tags                            : chr  "emb-78015" "" "" "" ...
##  $ first_packaging_code_geo                  : chr  "48.983333,2.066667" "" "" "" ...
##  $ cities                                    : logi  NA NA NA NA NA NA ...
##  $ cities_tags                               : chr  "andresy-yvelines-france" "" "" "" ...
##  $ purchase_places                           : chr  "Lyon,France" "NSW,Australia" "France" "Madrid,España" ...
##  $ stores                                    : chr  "Casino" "" "" "El Corte Inglés" ...
##  $ countries                                 : chr  "France" "Australia" "France" "España" ...
##  $ countries_tags                            : chr  "en:france" "en:australia" "en:france" "en:spain" ...
##  $ countries_en                              : chr  "France" "Australia" "France" "Spain" ...
##  $ ingredients_text                          : chr  "Sucre de canne, fraises 40 g, fraises des bois 14 g, gélifiant : pectines de fruits, jus de citron concentré."| __truncated__ "" "Pulpe de pommes 50% , sucre, sirop de glucose, gélifiant : pectine, acidifiant : acide citrique, arômes, colo"| __truncated__ "Extracto de soja (78%) (agua, semillas de soja 8,3%), grasas vegetales, jarabe de glucosa, dextrosa, emulsionan"| __truncated__ ...
##  $ allergens                                 : chr  "" "" "" "" ...
##  $ allergens_en                              : logi  NA NA NA NA NA NA ...
##  $ traces                                    : chr  "Lait,Fruits à coque" "" "" "" ...
##  $ traces_tags                               : chr  "en:milk,en:nuts" "" "" "" ...
##  $ traces_en                                 : chr  "Milk,Nuts" "" "" "" ...
##  $ serving_size                              : chr  "15 g" "" "" "" ...
##  $ no_nutriments                             : logi  NA NA NA NA NA NA ...
##  $ additives_n                               : int  1 NA 2 5 0 NA NA 0 NA 1 ...
##  $ additives                                 : chr  "[ sucre-de-canne -> fr:sucre-de-canne  ]  [ sucre-de -> fr:sucre-de  ]  [ sucre -> fr:sucre  ]  [ fraises-40-g "| __truncated__ "" "[ pulpe-de-pommes-50 -> fr:pulpe-de-pommes-50  ]  [ pulpe-de-pommes -> fr:pulpe-de-pommes  ]  [ pulpe-de -> fr:"| __truncated__ "[ extracto-de-soja -> es:extracto-de-soja  ]  [ 78 -> es:78  ]  [ agua -> es:agua  ]  [ semillas-de-soja-8 -> e"| __truncated__ ...
##  $ additives_tags                            : chr  "en:e440" "" "en:e440,en:e330" "en:e471,en:e415,en:e407,en:e412,en:e306" ...
##  $ additives_en                              : chr  "E440 - Pectins" "" "E440 - Pectins,E330 - Citric acid" "E471 - Mono- and diglycerides of fatty acids,E415 - Xanthan gum,E407 - Carrageenan,E412 - Guar gum,E306 - Tocop"| __truncated__ ...
##  $ ingredients_from_palm_oil_n               : int  0 NA 0 0 0 NA NA 0 NA 0 ...
##  $ ingredients_from_palm_oil                 : logi  NA NA NA NA NA NA ...
##  $ ingredients_from_palm_oil_tags            : chr  "" "" "" "" ...
##  $ ingredients_that_may_be_from_palm_oil_n   : int  0 NA 0 1 0 NA NA 0 NA 0 ...
##  $ ingredients_that_may_be_from_palm_oil     : logi  NA NA NA NA NA NA ...
##  $ ingredients_that_may_be_from_palm_oil_tags: chr  "" "" "" "e471-mono-et-diglycerides-d-acides-gras-alimentaires" ...
##  $ nutrition_grade_uk                        : logi  NA NA NA NA NA NA ...
##  $ nutrition_grade_fr                        : chr  "d" "" "" "d" ...
##  $ pnns_groups_1                             : chr  "Sugary snacks" "Sugary snacks" "Fruits and vegetables" "unknown" ...
##  $ pnns_groups_2                             : chr  "Sweets" "Chocolate products" "Fruits" "unknown" ...
##  $ states                                    : chr  "en:to-be-checked, en:complete, en:nutrition-facts-completed, en:ingredients-completed, en:expiration-date-to-be"| __truncated__ "en:to-be-completed, en:nutrition-facts-to-be-completed, en:ingredients-to-be-completed, en:expiration-date-to-b"| __truncated__ "en:to-be-checked, en:complete, en:nutrition-facts-completed, en:ingredients-completed, en:expiration-date-to-be"| __truncated__ "en:to-be-checked, en:complete, en:nutrition-facts-completed, en:ingredients-completed, en:expiration-date-compl"| __truncated__ ...
##  $ states_tags                               : chr  "en:to-be-checked,en:complete,en:nutrition-facts-completed,en:ingredients-completed,en:expiration-date-to-be-com"| __truncated__ "en:to-be-completed,en:nutrition-facts-to-be-completed,en:ingredients-to-be-completed,en:expiration-date-to-be-c"| __truncated__ "en:to-be-checked,en:complete,en:nutrition-facts-completed,en:ingredients-completed,en:expiration-date-to-be-com"| __truncated__ "en:to-be-checked,en:complete,en:nutrition-facts-completed,en:ingredients-completed,en:expiration-date-completed"| __truncated__ ...
##  $ states_en                                 : chr  "To be checked,Complete,Nutrition facts completed,Ingredients completed,Expiration date to be completed,Characte"| __truncated__ "To be completed,Nutrition facts to be completed,Ingredients to be completed,Expiration date to be completed,Cha"| __truncated__ "To be checked,Complete,Nutrition facts completed,Ingredients completed,Expiration date to be completed,Characte"| __truncated__ "To be checked,Complete,Nutrition facts completed,Ingredients completed,Expiration date completed,Characteristic"| __truncated__ ...
##  $ main_category                             : chr  "en:plant-based-foods-and-beverages" "en:sugary-snacks" "en:plant-based-foods-and-beverages" "en:plant-based-foods-and-beverages" ...
##  $ main_category_en                          : chr  "Plant-based foods and beverages" "Sugary snacks" "Plant-based foods and beverages" "Plant-based foods and beverages" ...
##  $ image_url                                 : chr  "http://en.openfoodfacts.org/images/products/322/247/574/5867/front.8.400.jpg" "http://en.openfoodfacts.org/images/products/541/097/688/0110/front.7.400.jpg" "http://en.openfoodfacts.org/images/products/326/475/042/3503/front.6.400.jpg" "http://en.openfoodfacts.org/images/products/800/604/024/7001/front.7.400.jpg" ...
##  $ image_small_url                           : chr  "http://en.openfoodfacts.org/images/products/322/247/574/5867/front.8.200.jpg" "http://en.openfoodfacts.org/images/products/541/097/688/0110/front.7.200.jpg" "http://en.openfoodfacts.org/images/products/326/475/042/3503/front.6.200.jpg" "http://en.openfoodfacts.org/images/products/800/604/024/7001/front.7.200.jpg" ...
##  $ energy_100g                               : num  918 NA NA 766 2359 ...
##  $ energy_from_fat_100g                      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ fat_100g                                  : num  0 NA NA 16.7 45.5 NA NA 25 NA 4 ...
##  $ saturated_fat_100g                        : num  0 NA NA 9.9 5.2 NA NA 17 NA 0.54 ...
##  $ butyric_acid_100g                         : logi  NA NA NA NA NA NA ...
##  $ caproic_acid_100g                         : logi  NA NA NA NA NA NA ...
##  $ caprylic_acid_100g                        : logi  NA NA NA NA NA NA ...
##  $ capric_acid_100g                          : logi  NA NA NA NA NA NA ...
##  $ lauric_acid_100g                          : logi  NA NA NA NA NA NA ...
##  $ myristic_acid_100g                        : logi  NA NA NA NA NA NA ...
##  $ palmitic_acid_100g                        : logi  NA NA NA NA NA NA ...
##  $ stearic_acid_100g                         : logi  NA NA NA NA NA NA ...
##  $ arachidic_acid_100g                       : logi  NA NA NA NA NA NA ...
##  $ behenic_acid_100g                         : logi  NA NA NA NA NA NA ...
##  $ lignoceric_acid_100g                      : logi  NA NA NA NA NA NA ...
##  $ cerotic_acid_100g                         : logi  NA NA NA NA NA NA ...
##  $ montanic_acid_100g                        : logi  NA NA NA NA NA NA ...
##  $ melissic_acid_100g                        : logi  NA NA NA NA NA NA ...
##  $ monounsaturated_fat_100g                  : num  NA NA NA 2.9 9.5 NA NA NA NA NA ...
##  $ polyunsaturated_fat_100g                  : num  NA NA NA 3.9 32.8 NA NA NA NA NA ...
##  $ omega_3_fat_100g                          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ alpha_linolenic_acid_100g                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ eicosapentaenoic_acid_100g                : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ docosahexaenoic_acid_100g                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ omega_6_fat_100g                          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ linoleic_acid_100g                        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ arachidonic_acid_100g                     : logi  NA NA NA NA NA NA ...
##  $ gamma_linolenic_acid_100g                 : logi  NA NA NA NA NA NA ...
##  $ dihomo_gamma_linolenic_acid_100g          : logi  NA NA NA NA NA NA ...
##  $ omega_9_fat_100g                          : logi  NA NA NA NA NA NA ...
##  $ oleic_acid_100g                           : logi  NA NA NA NA NA NA ...
##  $ elaidic_acid_100g                         : logi  NA NA NA NA NA NA ...
##  $ gondoic_acid_100g                         : logi  NA NA NA NA NA NA ...
##  $ mead_acid_100g                            : logi  NA NA NA NA NA NA ...
##  $ erucic_acid_100g                          : logi  NA NA NA NA NA NA ...
##   [list output truncated]
head(food)
##   V1   code
## 1  1 100030
## 2  2 100050
## 3  3 100079
## 4  4 100094
## 5  5 100124
## 6  6 100136
##                                                                                                                            url
## 1 http://world-en.openfoodfacts.org/product/3222475745867/confiture-de-fraise-fraise-des-bois-au-sucre-de-canne-casino-delices
## 2                                         http://world-en.openfoodfacts.org/product/5410976880110/guylian-sea-shells-selection
## 3                                  http://world-en.openfoodfacts.org/product/3264750423503/pates-de-fruits-aromatisees-jacquot
## 4                                  http://world-en.openfoodfacts.org/product/8006040247001/nata-vegetal-a-base-de-soja-valsoia
## 5           http://world-en.openfoodfacts.org/product/8480000340764/semillas-de-girasol-con-cascara-tostadas-aguasal-hacendado
## 6                                                           http://world-en.openfoodfacts.org/product/0087703177727/soft-drink
##       creator  created_t     created_datetime last_modified_t
## 1    sebleouf 1424747544 2015-02-24T03:12:24Z      1438445887
## 2 foodorigins 1450316429 2015-12-17T01:40:29Z      1450817956
## 3    domdom26 1428674916 2015-04-10T14:08:36Z      1428739289
## 4     javichu 1420416591 2015-01-05T00:09:51Z      1420417876
## 5     javichu 1420501121 2015-01-05T23:38:41Z      1445700917
## 6 foodorigins 1437983923 2015-07-27T07:58:43Z      1445577476
##   last_modified_datetime
## 1   2015-08-01T16:18:07Z
## 2   2015-12-22T20:59:16Z
## 3   2015-04-11T08:01:29Z
## 4   2015-01-05T00:31:16Z
## 5   2015-10-24T15:35:17Z
## 6   2015-10-23T05:17:56Z
##                                            product_name
## 1 Confiture de fraise fraise des bois au sucre de canne
## 2                          Guylian Sea Shells Selection
## 3                         Pâtes de fruits aromatisées
## 4       Nata vegetal a base de soja &quot;Valsoia&quot;
## 5     Semillas de girasol con cáscara tostadas aguasal
## 6                                            Soft Drink
##                                        generic_name quantity
## 1                                                      265 g
## 2                                                       375g
## 3                                  Pâtes de fruits     1 kg
## 4                       Nata vegetal a base de soja   200 ml
## 5 Semillas de girasol con cáscara tostadas aguasal    200 g
## 6                                                           
##                                              packaging
## 1                                          Bocal,Verre
## 2                                          Plastic,Box
## 3                                     Carton,plastique
## 4                                           Tetra Brik
## 5 Bolsa de plástico,Envasado en atmósfera protectora
## 6                                                     
##                                       packaging_tags
## 1                                        bocal,verre
## 2                                        plastic,box
## 3                                   carton,plastique
## 4                                         tetra-brik
## 5 bolsa-de-plastico,envasado-en-atmosfera-protectora
## 6                                                   
##                                       brands
## 1                            Casino Délices
## 2                                    Guylian
## 3                                    Jacquot
## 4   Valsoia,//Propiedad de://,Valsoia S.p.A.
## 5 Hacendado,//Propiedad de://,Mercadona S.A.
## 6                                           
##                            brands_tags
## 1                       casino-delices
## 2                              guylian
## 3                              jacquot
## 4   valsoia,propiedad-de,valsoia-s-p-a
## 5 hacendado,propiedad-de,mercadona-s-a
## 6                                     
##                                                                                                                                                                                                                                                                                                                                                    categories
## 1 Aliments et boissons à base de végétaux,Aliments d'origine végétale,Aliments à base de fruits et de légumes,Petit-déjeuners,Produits à tartiner,Fruits et produits dérivés,Pâtes à tartiner végétaux,Produits à tartiner sucrés,Confitures et marmelades,Confitures,Confitures de fruits,Confitures de fruits rouges,Confitures de fraises
## 2                                                                                                                                                                                                                                                                                                                                                   Chocolate
## 3                                                                                                                                                                                                                                                                                                                                            pâtes de fruits
## 4                                                                                                                                                                                                  Alimentos y bebidas de origen vegetal,Alimentos de origen vegetal,Natas vegetales,Natas vegetales a base de soja para cocinar,Natas vegetales para cocinar
## 5                                                                                                                                Semillas de girasol y derivados, Semillas, Semillas de girasol, Semillas de girasol con cáscara, Semillas de girasol tostadas, Semillas de girasol con cáscara tostadas, Semillas de girasol con cáscara tostadas aguasal
## 6                                                                                                                                                                                                                                                                                                                                                            
##                                                                                                                                                                                                                                                              categories_tags
## 1              en:plant-based-foods-and-beverages,en:plant-based-foods,en:fruits-and-vegetables-based-foods,en:breakfasts,en:spreads,en:fruits-based-foods,en:plant-based-spreads,en:sweet-spreads,en:fruit-preserves,en:jams,en:fruit-jams,en:berry-jams,en:strawberry-jams
## 2                                                                                                                                                                                                                                             en:sugary-snacks,en:chocolates
## 3                                                                                                     en:plant-based-foods-and-beverages,en:plant-based-foods,en:fruits-and-vegetables-based-foods,en:sugary-snacks,en:confectioneries,en:fruits-based-foods,en:fruit-pastes
## 4                                                                                                                            en:plant-based-foods-and-beverages,en:plant-based-foods,en:plant-based-creams,en:plant-based-creams-for-cooking,en:soy-based-creams-for-cooking
## 5 en:plant-based-foods-and-beverages,en:plant-based-foods,en:seeds,en:sunflower-seeds-and-their-products,en:sunflower-seeds,en:roasted-sunflower-seeds,en:unshelled-sunflower-seeds,en:roasted-unshelled-sunflower-seeds,es:semillas-de-girasol-con-cascara-tostadas-aguasal
## 6                                                                                                                                                                                                                                                                           
##                                                                                                                                                                                                                                        categories_en
## 1                             Plant-based foods and beverages,Plant-based foods,Fruits and vegetables based foods,Breakfasts,Spreads,Fruits based foods,Plant-based spreads,Sweet spreads,Fruit preserves,Jams,Fruit jams,Berry jams,Strawberry jams
## 2                                                                                                                                                                                                                           Sugary snacks,Chocolates
## 3                                                                                                  Plant-based foods and beverages,Plant-based foods,Fruits and vegetables based foods,Sugary snacks,Confectioneries,Fruits based foods,Fruit pastes
## 4                                                                                                                   Plant-based foods and beverages,Plant-based foods,Plant-based creams,Plant-based creams for cooking,Soy-based creams for cooking
## 5 Plant-based foods and beverages,Plant-based foods,Seeds,Sunflower seeds and their products,Sunflower seeds,Roasted sunflower seeds,Unshelled sunflower seeds,Roasted unshelled sunflower seeds,es:Semillas-de-girasol-con-cascara-tostadas-aguasal
## 6                                                                                                                                                                                                                                                   
##       origins origins_tags
## 1                         
## 2                         
## 3                         
## 4                         
## 5   Argentina    argentina
## 6 South Korea  south-korea
##                                            manufacturing_places
## 1                                                        France
## 2                                                       Belgium
## 3                                                              
## 4                                                        Italia
## 5 Beniparrell,Valencia (provincia),Comunidad Valenciana,España
## 6                                                   South Korea
##                                    manufacturing_places_tags
## 1                                                     france
## 2                                                    belgium
## 3                                                           
## 4                                                     italia
## 5 beniparrell,valencia-provincia,comunidad-valenciana,espana
## 6                                                south-korea
##                                              labels
## 1                                                  
## 2                                                  
## 3                                                  
## 4 Vegetariano,Vegano,Sin gluten,Sin OMG,Sin lactosa
## 5                     Vegetariano,Vegano,Sin gluten
## 6                                                  
##                                                      labels_tags
## 1                                                               
## 2                                                               
## 3                                                               
## 4 en:vegetarian,en:vegan,en:gluten-free,en:no-gmos,en:no-lactose
## 5                          en:vegetarian,en:vegan,en:gluten-free
## 6                                                               
##                                         labels_en
## 1                                                
## 2                                                
## 3                                                
## 4 Vegetarian,Vegan,Gluten-free,No GMOs,No lactose
## 5                    Vegetarian,Vegan,Gluten-free
## 6                                                
##                                     emb_codes
## 1                                   EMB 78015
## 2                                            
## 3                                            
## 4                                            
## 5 ES 21.016540/V EC,ENVASADOR:,IMPORTACO S.A.
## 6                                            
##                              emb_codes_tags first_packaging_code_geo
## 1                                 emb-78015       48.983333,2.066667
## 2                                                                   
## 3                                                                   
## 4                                                                   
## 5 es-21-016540-v-ec,envasador,importaco-s-a                         
## 6                                                                   
##   cities             cities_tags purchase_places           stores
## 1     NA andresy-yvelines-france     Lyon,France           Casino
## 2     NA                           NSW,Australia                 
## 3     NA                                  France                 
## 4     NA                          Madrid,España El Corte Inglés
## 5     NA                          Madrid,España        Mercadona
## 6     NA                                                         
##   countries countries_tags countries_en
## 1    France      en:france       France
## 2 Australia   en:australia    Australia
## 3    France      en:france       France
## 4   España       en:spain        Spain
## 5   España       en:spain        Spain
## 6 Australia   en:australia    Australia
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ingredients_text
## 1                                                                                                                                                                                                                                                                                                                                                                     Sucre de canne, fraises 40 g, fraises des bois 14 g, gélifiant : pectines de fruits, jus de citron concentré. Préparée avec 54 g de fruits pour 100 g de produit fini.
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
## 3                                                                                                                                                                                                                                                                                                                        Pulpe de pommes 50% , sucre, sirop de glucose, gélifiant : pectine, acidifiant : acide citrique, arômes, colorants naturels : extrait de paprika — complexes cuivre—chlorophyllines — curcumine — antnocyanes
## 4 Extracto de soja (78%) (agua, semillas de soja 8,3%), grasas vegetales, jarabe de glucosa, dextrosa, emulsionante: mono- y diglicéridos de ácidos grasos (E-471), sal marina, estabilizantes: goma xantana (E-415), carragenatos (E-407), goma guar (E-412); aromas, antioxidante: extractos de tocoferoles (de soja) (E-306). (Nota: el envase en italiano del paquete -que puede verse en el enlace-, especifica que el producto es 100% vegetal. Por tanto los mono- y diglicéridos de ácidos grasos (E-471) son de origen no animal). 
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Pipas de girasol y sal.
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
##   allergens allergens_en                        traces        traces_tags
## 1                     NA          Lait,Fruits à coque    en:milk,en:nuts
## 2                     NA                                                 
## 3                     NA                                                 
## 4                     NA                                                 
## 5                     NA Frutos de cáscara,Cacahuetes en:nuts,en:peanuts
## 6                     NA                                                 
##      traces_en serving_size no_nutriments additives_n
## 1    Milk,Nuts         15 g            NA           1
## 2                                      NA          NA
## 3                                      NA           2
## 4                                      NA           5
## 5 Nuts,Peanuts                         NA           0
## 6                                      NA          NA
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           additives
## 1 [ sucre-de-canne -> fr:sucre-de-canne  ]  [ sucre-de -> fr:sucre-de  ]  [ sucre -> fr:sucre  ]  [ fraises-40-g -> fr:fraises-40-g  ]  [ fraises-40 -> fr:fraises-40  ]  [ fraises -> fr:fraises  ]  [ fraises-des-bois-14-g -> fr:fraises-des-bois-14-g  ]  [ fraises-des-bois-14 -> fr:fraises-des-bois-14  ]  [ fraises-des-bois -> fr:fraises-des-bois  ]  [ fraises-des -> fr:fraises-des  ]  [ fraises -> fr:fraises  ]  [ pectines-de-fruits -> fr:pectines-de-fruits  ]  [ pectines-de -> fr:pectines-de  ]  [ pectines -> en:e440  -> exists  ]  [ jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour-100-g-de-produit-fini -> fr:jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour-100-g-de-produit-fini  ]  [ jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour-100-g-de-produit -> fr:jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour-100-g-de-produit  ]  [ jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour-100-g-de -> fr:jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour-100-g-de  ]  [ jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour-100-g -> fr:jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour-100-g  ]  [ jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour-100 -> fr:jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour-100  ]  [ jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour -> fr:jus-de-citron-concentre-preparee-avec-54-g-de-fruits-pour  ]  [ jus-de-citron-concentre-preparee-avec-54-g-de-fruits -> fr:jus-de-citron-concentre-preparee-avec-54-g-de-fruits  ]  [ jus-de-citron-concentre-preparee-avec-54-g-de -> fr:jus-de-citron-concentre-preparee-avec-54-g-de  ]  [ jus-de-citron-concentre-preparee-avec-54-g -> fr:jus-de-citron-concentre-preparee-avec-54-g  ]  [ jus-de-citron-concentre-preparee-avec-54 -> fr:jus-de-citron-concentre-preparee-avec-54  ]  [ jus-de-citron-concentre-preparee-avec -> fr:jus-de-citron-concentre-preparee-avec  ]  [ jus-de-citron-concentre-preparee -> fr:jus-de-citron-concentre-preparee  ]  [ jus-de-citron-concentre -> fr:jus-de-citron-concentre  ]  [ jus-de-citron -> fr:jus-de-citron  ]  [ jus-de -> fr:jus-de  ]  [ jus -> fr:jus  ]
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       [ pulpe-de-pommes-50 -> fr:pulpe-de-pommes-50  ]  [ pulpe-de-pommes -> fr:pulpe-de-pommes  ]  [ pulpe-de -> fr:pulpe-de  ]  [ pulpe -> fr:pulpe  ]  [ sucre -> fr:sucre  ]  [ sirop-de-glucose -> fr:sirop-de-glucose  ]  [ sirop-de -> fr:sirop-de  ]  [ sirop -> fr:sirop  ]  [ pectine -> en:e440  -> exists  ]  [ acide-citrique -> en:e330  -> exists  ]  [ aromes -> fr:aromes  ]  [ naturels -> fr:naturels  ]  [ extrait-de-paprika-complexes-cuivre-chlorophyllines-curcumine-antnocyanes -> fr:extrait-de-paprika-complexes-cuivre-chlorophyllines-curcumine-antnocyanes  ]  [ extrait-de-paprika-complexes-cuivre-chlorophyllines-curcumine -> fr:extrait-de-paprika-complexes-cuivre-chlorophyllines-curcumine  ]  [ extrait-de-paprika-complexes-cuivre-chlorophyllines -> fr:extrait-de-paprika-complexes-cuivre-chlorophyllines  ]  [ extrait-de-paprika-complexes-cuivre -> fr:extrait-de-paprika-complexes-cuivre  ]  [ extrait-de-paprika-complexes -> fr:extrait-de-paprika-complexes  ]  [ extrait-de-paprika -> fr:extrait-de-paprika  ]  [ extrait-de -> fr:extrait-de  ]  [ extrait -> fr:extrait  ]
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       [ extracto-de-soja -> es:extracto-de-soja  ]  [ 78 -> es:78  ]  [ agua -> es:agua  ]  [ semillas-de-soja-8 -> es:semillas-de-soja-8  ]  [ 3 -> en:fd-c  ]  [ grasas-vegetales -> es:grasas-vegetales  ]  [ jarabe-de-glucosa -> es:jarabe-de-glucosa  ]  [ dextrosa -> es:dextrosa  ]  [ emulsionante -> es:emulsionante  ]  [ mono-y-digliceridos-de-acidos-grasos -> en:e471  -> exists  ]  [ e471 -> en:e471  ]  [ sal-marina -> es:sal-marina  ]  [ estabilizantes -> es:estabilizantes  ]  [ goma-xantana -> en:e415  -> exists  ]  [ e415 -> en:e415  ]  [ carragenatos -> en:e407  -> exists  ]  [ e407 -> en:e407  ]  [ goma-guar -> en:e412  -> exists  ]  [ e412 -> en:e412  ]  [ aromas -> es:aromas  ]  [ antioxidante -> es:antioxidante  ]  [ extractos-de-tocoferoles -> es:extractos-de-tocoferoles  ]  [ de-soja -> es:de-soja  ]  [ e306 -> en:e306  -> exists  ]  [ nota -> es:nota  ]  [ el-envase-en-italiano-del-paquete-que-puede-verse-en-el-enlace -> es:el-envase-en-italiano-del-paquete-que-puede-verse-en-el-enlace  ]  [ especifica-que-el-producto-es-100-vegetal-por-tanto-los-mono-y-digliceridos-de-acidos-grasos -> es:especifica-que-el-producto-es-100-vegetal-por-tanto-los-mono-y-digliceridos-de-acidos-grasos  ]  [ e471 -> en:e471  ]  [ son-de-origen-no-animal -> es:son-de-origen-no-animal  ]  [   -> es:   ]
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [ pipas-de-girasol-y-sal -> es:pipas-de-girasol-y-sal  ]
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
##                            additives_tags
## 1                                 en:e440
## 2                                        
## 3                         en:e440,en:e330
## 4 en:e471,en:e415,en:e407,en:e412,en:e306
## 5                                        
## 6                                        
##                                                                                                                        additives_en
## 1                                                                                                                    E440 - Pectins
## 2                                                                                                                                  
## 3                                                                                                 E440 - Pectins,E330 - Citric acid
## 4 E471 - Mono- and diglycerides of fatty acids,E415 - Xanthan gum,E407 - Carrageenan,E412 - Guar gum,E306 - Tocopherol-rich extract
## 5                                                                                                                                  
## 6                                                                                                                                  
##   ingredients_from_palm_oil_n ingredients_from_palm_oil
## 1                           0                        NA
## 2                          NA                        NA
## 3                           0                        NA
## 4                           0                        NA
## 5                           0                        NA
## 6                          NA                        NA
##   ingredients_from_palm_oil_tags ingredients_that_may_be_from_palm_oil_n
## 1                                                                      0
## 2                                                                     NA
## 3                                                                      0
## 4                                                                      1
## 5                                                                      0
## 6                                                                     NA
##   ingredients_that_may_be_from_palm_oil
## 1                                    NA
## 2                                    NA
## 3                                    NA
## 4                                    NA
## 5                                    NA
## 6                                    NA
##             ingredients_that_may_be_from_palm_oil_tags nutrition_grade_uk
## 1                                                                      NA
## 2                                                                      NA
## 3                                                                      NA
## 4 e471-mono-et-diglycerides-d-acides-gras-alimentaires                 NA
## 5                                                                      NA
## 6                                                                      NA
##   nutrition_grade_fr         pnns_groups_1      pnns_groups_2
## 1                  d         Sugary snacks             Sweets
## 2                            Sugary snacks Chocolate products
## 3                    Fruits and vegetables             Fruits
## 4                  d               unknown            unknown
## 5                  d               unknown            unknown
## 6                                  unknown            unknown
##                                                                                                                                                                                                                                                                                                                               states
## 1                                                                                                                                   en:to-be-checked, en:complete, en:nutrition-facts-completed, en:ingredients-completed, en:expiration-date-to-be-completed, en:characteristics-completed, en:photos-validated, en:photos-uploaded
## 2                                                                                                                                  en:to-be-completed, en:nutrition-facts-to-be-completed, en:ingredients-to-be-completed, en:expiration-date-to-be-completed, en:characteristics-completed, en:photos-validated, en:photos-uploaded
## 3                                                                                                                                   en:to-be-checked, en:complete, en:nutrition-facts-completed, en:ingredients-completed, en:expiration-date-to-be-completed, en:characteristics-completed, en:photos-validated, en:photos-uploaded
## 4                                                                                                                                         en:to-be-checked, en:complete, en:nutrition-facts-completed, en:ingredients-completed, en:expiration-date-completed, en:characteristics-completed, en:photos-validated, en:photos-uploaded
## 5                                                                                                                                         en:to-be-checked, en:complete, en:nutrition-facts-completed, en:ingredients-completed, en:expiration-date-completed, en:characteristics-completed, en:photos-validated, en:photos-uploaded
## 6 en:to-be-completed, en:nutrition-facts-to-be-completed, en:ingredients-to-be-completed, en:expiration-date-to-be-completed, en:characteristics-to-be-completed, en:categories-to-be-completed, en:brands-to-be-completed, en:packaging-to-be-completed, en:quantity-to-be-completed, en:photos-to-be-validated, en:photos-uploaded
##                                                                                                                                                                                                                                                                                                                states_tags
## 1                                                                                                                                en:to-be-checked,en:complete,en:nutrition-facts-completed,en:ingredients-completed,en:expiration-date-to-be-completed,en:characteristics-completed,en:photos-validated,en:photos-uploaded
## 2                                                                                                                              en:to-be-completed,en:nutrition-facts-to-be-completed,en:ingredients-to-be-completed,en:expiration-date-to-be-completed,en:characteristics-completed,en:photos-validated,en:photos-uploaded
## 3                                                                                                                                en:to-be-checked,en:complete,en:nutrition-facts-completed,en:ingredients-completed,en:expiration-date-to-be-completed,en:characteristics-completed,en:photos-validated,en:photos-uploaded
## 4                                                                                                                                      en:to-be-checked,en:complete,en:nutrition-facts-completed,en:ingredients-completed,en:expiration-date-completed,en:characteristics-completed,en:photos-validated,en:photos-uploaded
## 5                                                                                                                                      en:to-be-checked,en:complete,en:nutrition-facts-completed,en:ingredients-completed,en:expiration-date-completed,en:characteristics-completed,en:photos-validated,en:photos-uploaded
## 6 en:to-be-completed,en:nutrition-facts-to-be-completed,en:ingredients-to-be-completed,en:expiration-date-to-be-completed,en:characteristics-to-be-completed,en:categories-to-be-completed,en:brands-to-be-completed,en:packaging-to-be-completed,en:quantity-to-be-completed,en:photos-to-be-validated,en:photos-uploaded
##                                                                                                                                                                                                                                                                                 states_en
## 1                                                                                                                       To be checked,Complete,Nutrition facts completed,Ingredients completed,Expiration date to be completed,Characteristics completed,Photos validated,Photos uploaded
## 2                                                                                                                  To be completed,Nutrition facts to be completed,Ingredients to be completed,Expiration date to be completed,Characteristics completed,Photos validated,Photos uploaded
## 3                                                                                                                       To be checked,Complete,Nutrition facts completed,Ingredients completed,Expiration date to be completed,Characteristics completed,Photos validated,Photos uploaded
## 4                                                                                                                             To be checked,Complete,Nutrition facts completed,Ingredients completed,Expiration date completed,Characteristics completed,Photos validated,Photos uploaded
## 5                                                                                                                             To be checked,Complete,Nutrition facts completed,Ingredients completed,Expiration date completed,Characteristics completed,Photos validated,Photos uploaded
## 6 To be completed,Nutrition facts to be completed,Ingredients to be completed,Expiration date to be completed,Characteristics to be completed,Categories to be completed,Brands to be completed,Packaging to be completed,Quantity to be completed,Photos to be validated,Photos uploaded
##                        main_category                main_category_en
## 1 en:plant-based-foods-and-beverages Plant-based foods and beverages
## 2                   en:sugary-snacks                   Sugary snacks
## 3 en:plant-based-foods-and-beverages Plant-based foods and beverages
## 4 en:plant-based-foods-and-beverages Plant-based foods and beverages
## 5 en:plant-based-foods-and-beverages Plant-based foods and beverages
## 6                                                                   
##                                                                      image_url
## 1 http://en.openfoodfacts.org/images/products/322/247/574/5867/front.8.400.jpg
## 2 http://en.openfoodfacts.org/images/products/541/097/688/0110/front.7.400.jpg
## 3 http://en.openfoodfacts.org/images/products/326/475/042/3503/front.6.400.jpg
## 4 http://en.openfoodfacts.org/images/products/800/604/024/7001/front.7.400.jpg
## 5 http://en.openfoodfacts.org/images/products/848/000/034/0764/front.6.400.jpg
## 6 http://en.openfoodfacts.org/images/products/008/770/317/7727/front.8.400.jpg
##                                                                image_small_url
## 1 http://en.openfoodfacts.org/images/products/322/247/574/5867/front.8.200.jpg
## 2 http://en.openfoodfacts.org/images/products/541/097/688/0110/front.7.200.jpg
## 3 http://en.openfoodfacts.org/images/products/326/475/042/3503/front.6.200.jpg
## 4 http://en.openfoodfacts.org/images/products/800/604/024/7001/front.7.200.jpg
## 5 http://en.openfoodfacts.org/images/products/848/000/034/0764/front.6.200.jpg
## 6 http://en.openfoodfacts.org/images/products/008/770/317/7727/front.8.200.jpg
##   energy_100g energy_from_fat_100g fat_100g saturated_fat_100g
## 1         918                   NA      0.0                0.0
## 2          NA                   NA       NA                 NA
## 3          NA                   NA       NA                 NA
## 4         766                   NA     16.7                9.9
## 5        2359                   NA     45.5                5.2
## 6          NA                   NA       NA                 NA
##   butyric_acid_100g caproic_acid_100g caprylic_acid_100g capric_acid_100g
## 1                NA                NA                 NA               NA
## 2                NA                NA                 NA               NA
## 3                NA                NA                 NA               NA
## 4                NA                NA                 NA               NA
## 5                NA                NA                 NA               NA
## 6                NA                NA                 NA               NA
##   lauric_acid_100g myristic_acid_100g palmitic_acid_100g stearic_acid_100g
## 1               NA                 NA                 NA                NA
## 2               NA                 NA                 NA                NA
## 3               NA                 NA                 NA                NA
## 4               NA                 NA                 NA                NA
## 5               NA                 NA                 NA                NA
## 6               NA                 NA                 NA                NA
##   arachidic_acid_100g behenic_acid_100g lignoceric_acid_100g
## 1                  NA                NA                   NA
## 2                  NA                NA                   NA
## 3                  NA                NA                   NA
## 4                  NA                NA                   NA
## 5                  NA                NA                   NA
## 6                  NA                NA                   NA
##   cerotic_acid_100g montanic_acid_100g melissic_acid_100g
## 1                NA                 NA                 NA
## 2                NA                 NA                 NA
## 3                NA                 NA                 NA
## 4                NA                 NA                 NA
## 5                NA                 NA                 NA
## 6                NA                 NA                 NA
##   monounsaturated_fat_100g polyunsaturated_fat_100g omega_3_fat_100g
## 1                       NA                       NA               NA
## 2                       NA                       NA               NA
## 3                       NA                       NA               NA
## 4                      2.9                      3.9               NA
## 5                      9.5                     32.8               NA
## 6                       NA                       NA               NA
##   alpha_linolenic_acid_100g eicosapentaenoic_acid_100g
## 1                        NA                         NA
## 2                        NA                         NA
## 3                        NA                         NA
## 4                        NA                         NA
## 5                        NA                         NA
## 6                        NA                         NA
##   docosahexaenoic_acid_100g omega_6_fat_100g linoleic_acid_100g
## 1                        NA               NA                 NA
## 2                        NA               NA                 NA
## 3                        NA               NA                 NA
## 4                        NA               NA                 NA
## 5                        NA               NA                 NA
## 6                        NA               NA                 NA
##   arachidonic_acid_100g gamma_linolenic_acid_100g
## 1                    NA                        NA
## 2                    NA                        NA
## 3                    NA                        NA
## 4                    NA                        NA
## 5                    NA                        NA
## 6                    NA                        NA
##   dihomo_gamma_linolenic_acid_100g omega_9_fat_100g oleic_acid_100g
## 1                               NA               NA              NA
## 2                               NA               NA              NA
## 3                               NA               NA              NA
## 4                               NA               NA              NA
## 5                               NA               NA              NA
## 6                               NA               NA              NA
##   elaidic_acid_100g gondoic_acid_100g mead_acid_100g erucic_acid_100g
## 1                NA                NA             NA               NA
## 2                NA                NA             NA               NA
## 3                NA                NA             NA               NA
## 4                NA                NA             NA               NA
## 5                NA                NA             NA               NA
## 6                NA                NA             NA               NA
##   nervonic_acid_100g trans_fat_100g cholesterol_100g carbohydrates_100g
## 1                 NA             NA               NA               54.0
## 2                 NA             NA               NA                 NA
## 3                 NA             NA               NA                 NA
## 4                 NA             NA            2e-04                5.7
## 5                 NA             NA               NA               17.3
## 6                 NA             NA               NA                 NA
##   sugars_100g sucrose_100g glucose_100g fructose_100g lactose_100g
## 1        54.0           NA           NA            NA           NA
## 2          NA           NA           NA            NA           NA
## 3          NA           NA           NA            NA           NA
## 4         4.2           NA           NA            NA           NA
## 5         2.7           NA           NA            NA           NA
## 6          NA           NA           NA            NA           NA
##   maltose_100g maltodextrins_100g starch_100g polyols_100g fiber_100g
## 1           NA                 NA          NA           NA         NA
## 2           NA                 NA          NA           NA         NA
## 3           NA                 NA          NA           NA         NA
## 4           NA                 NA          NA           NA        0.2
## 5           NA                 NA          NA           NA        9.0
## 6           NA                 NA          NA           NA         NA
##   proteins_100g casein_100g serum_proteins_100g nucleotides_100g salt_100g
## 1           0.0          NA                  NA               NA    0.0000
## 2            NA          NA                  NA               NA        NA
## 3            NA          NA                  NA               NA        NA
## 4           2.9          NA                  NA               NA    0.0508
## 5          18.2          NA                  NA               NA    3.9878
## 6            NA          NA                  NA               NA        NA
##   sodium_100g alcohol_100g vitamin_a_100g beta_carotene_100g
## 1        0.00           NA             NA                 NA
## 2          NA           NA             NA                 NA
## 3          NA           NA             NA                 NA
## 4        0.02           NA             NA                 NA
## 5        1.57           NA             NA                 NA
## 6          NA           NA             NA                 NA
##   vitamin_d_100g vitamin_e_100g vitamin_k_100g vitamin_c_100g
## 1             NA             NA             NA             NA
## 2             NA             NA             NA             NA
## 3             NA             NA             NA             NA
## 4             NA             NA             NA             NA
## 5             NA             NA             NA             NA
## 6             NA             NA             NA             NA
##   vitamin_b1_100g vitamin_b2_100g vitamin_pp_100g vitamin_b6_100g
## 1              NA              NA              NA              NA
## 2              NA              NA              NA              NA
## 3              NA              NA              NA              NA
## 4              NA              NA              NA              NA
## 5              NA              NA              NA              NA
## 6              NA              NA              NA              NA
##   vitamin_b9_100g vitamin_b12_100g biotin_100g pantothenic_acid_100g
## 1              NA               NA          NA                    NA
## 2              NA               NA          NA                    NA
## 3              NA               NA          NA                    NA
## 4              NA               NA          NA                    NA
## 5              NA               NA          NA                    NA
## 6              NA               NA          NA                    NA
##   silica_100g bicarbonate_100g potassium_100g chloride_100g calcium_100g
## 1          NA               NA             NA            NA           NA
## 2          NA               NA             NA            NA           NA
## 3          NA               NA             NA            NA           NA
## 4          NA               NA             NA            NA           NA
## 5          NA               NA             NA            NA           NA
## 6          NA               NA             NA            NA           NA
##   phosphorus_100g iron_100g magnesium_100g zinc_100g copper_100g
## 1              NA        NA             NA        NA          NA
## 2              NA        NA             NA        NA          NA
## 3              NA        NA             NA        NA          NA
## 4              NA        NA             NA        NA          NA
## 5           1.155    0.0038          0.129        NA          NA
## 6              NA        NA             NA        NA          NA
##   manganese_100g fluoride_100g selenium_100g chromium_100g molybdenum_100g
## 1             NA            NA            NA            NA              NA
## 2             NA            NA            NA            NA              NA
## 3             NA            NA            NA            NA              NA
## 4             NA            NA            NA            NA              NA
## 5             NA            NA            NA            NA              NA
## 6             NA            NA            NA            NA              NA
##   iodine_100g caffeine_100g taurine_100g ph_100g
## 1          NA            NA           NA      NA
## 2          NA            NA           NA      NA
## 3          NA            NA           NA      NA
## 4          NA            NA           NA      NA
## 5          NA            NA           NA      NA
## 6          NA            NA           NA      NA
##   fruits_vegetables_nuts_100g collagen_meat_protein_ratio_100g cocoa_100g
## 1                          54                               NA         NA
## 2                          NA                               NA         NA
## 3                          NA                               NA         NA
## 4                          NA                               NA         NA
## 5                          NA                               NA         NA
## 6                          NA                               NA         NA
##   chlorophyl_100g carbon_footprint_100g nutrition_score_fr_100g
## 1              NA                    NA                      11
## 2              NA                    NA                      NA
## 3              NA                    NA                      NA
## 4              NA                    NA                      11
## 5              NA                    NA                      17
## 6              NA                    NA                      NA
##   nutrition_score_uk_100g
## 1                      11
## 2                      NA
## 3                      NA
## 4                      11
## 5                      17
## 6                      NA
summary(food)
##        V1              code            url              creator         
##  Min.   :   1.0   Min.   :100030   Length:1500        Length:1500       
##  1st Qu.: 375.8   1st Qu.:124975   Class :character   Class :character  
##  Median : 750.5   Median :149514   Mode  :character   Mode  :character  
##  Mean   : 750.5   Mean   :149613                                        
##  3rd Qu.:1125.2   3rd Qu.:174506                                        
##  Max.   :1500.0   Max.   :199880                                        
##                                                                         
##    created_t         created_datetime   last_modified_t    
##  Min.   :1.332e+09   Length:1500        Min.   :1.340e+09  
##  1st Qu.:1.394e+09   Class :character   1st Qu.:1.424e+09  
##  Median :1.425e+09   Mode  :character   Median :1.437e+09  
##  Mean   :1.414e+09                      Mean   :1.430e+09  
##  3rd Qu.:1.436e+09                      3rd Qu.:1.446e+09  
##  Max.   :1.453e+09                      Max.   :1.453e+09  
##                                                            
##  last_modified_datetime product_name       generic_name      
##  Length:1500            Length:1500        Length:1500       
##  Class :character       Class :character   Class :character  
##  Mode  :character       Mode  :character   Mode  :character  
##                                                              
##                                                              
##                                                              
##                                                              
##    quantity          packaging         packaging_tags    
##  Length:1500        Length:1500        Length:1500       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##     brands          brands_tags         categories       
##  Length:1500        Length:1500        Length:1500       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  categories_tags    categories_en        origins         
##  Length:1500        Length:1500        Length:1500       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  origins_tags       manufacturing_places manufacturing_places_tags
##  Length:1500        Length:1500          Length:1500              
##  Class :character   Class :character     Class :character         
##  Mode  :character   Mode  :character     Mode  :character         
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##     labels          labels_tags         labels_en        
##  Length:1500        Length:1500        Length:1500       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   emb_codes         emb_codes_tags     first_packaging_code_geo
##  Length:1500        Length:1500        Length:1500             
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##                                                                
##   cities        cities_tags        purchase_places       stores         
##  Mode:logical   Length:1500        Length:1500        Length:1500       
##  NA's:1500      Class :character   Class :character   Class :character  
##                 Mode  :character   Mode  :character   Mode  :character  
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##   countries         countries_tags     countries_en      
##  Length:1500        Length:1500        Length:1500       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  ingredients_text    allergens         allergens_en      traces         
##  Length:1500        Length:1500        Mode:logical   Length:1500       
##  Class :character   Class :character   NA's:1500      Class :character  
##  Mode  :character   Mode  :character                  Mode  :character  
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##  traces_tags         traces_en         serving_size       no_nutriments 
##  Length:1500        Length:1500        Length:1500        Mode:logical  
##  Class :character   Class :character   Class :character   NA's:1500     
##  Mode  :character   Mode  :character   Mode  :character                 
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##   additives_n      additives         additives_tags     additives_en      
##  Min.   : 0.000   Length:1500        Length:1500        Length:1500       
##  1st Qu.: 0.000   Class :character   Class :character   Class :character  
##  Median : 1.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 1.846                                                           
##  3rd Qu.: 3.000                                                           
##  Max.   :17.000                                                           
##  NA's   :514                                                              
##  ingredients_from_palm_oil_n ingredients_from_palm_oil
##  Min.   :0.0000              Mode:logical             
##  1st Qu.:0.0000              NA's:1500                
##  Median :0.0000                                       
##  Mean   :0.0487                                       
##  3rd Qu.:0.0000                                       
##  Max.   :1.0000                                       
##  NA's   :514                                          
##  ingredients_from_palm_oil_tags ingredients_that_may_be_from_palm_oil_n
##  Length:1500                    Min.   :0.0000                         
##  Class :character               1st Qu.:0.0000                         
##  Mode  :character               Median :0.0000                         
##                                 Mean   :0.1379                         
##                                 3rd Qu.:0.0000                         
##                                 Max.   :4.0000                         
##                                 NA's   :514                            
##  ingredients_that_may_be_from_palm_oil
##  Mode:logical                         
##  NA's:1500                            
##                                       
##                                       
##                                       
##                                       
##                                       
##  ingredients_that_may_be_from_palm_oil_tags nutrition_grade_uk
##  Length:1500                                Mode:logical      
##  Class :character                           NA's:1500         
##  Mode  :character                                             
##                                                               
##                                                               
##                                                               
##                                                               
##  nutrition_grade_fr pnns_groups_1      pnns_groups_2     
##  Length:1500        Length:1500        Length:1500       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##     states          states_tags         states_en        
##  Length:1500        Length:1500        Length:1500       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  main_category      main_category_en    image_url        
##  Length:1500        Length:1500        Length:1500       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  image_small_url     energy_100g     energy_from_fat_100g    fat_100g     
##  Length:1500        Min.   :   0.0   Min.   :   0.00      Min.   :  0.00  
##  Class :character   1st Qu.: 369.8   1st Qu.:  35.98      1st Qu.:  0.90  
##  Mode  :character   Median : 966.5   Median : 237.00      Median :  6.00  
##                     Mean   :1083.2   Mean   : 668.41      Mean   : 13.39  
##                     3rd Qu.:1641.5   3rd Qu.: 974.00      3rd Qu.: 20.00  
##                     Max.   :3700.0   Max.   :2900.00      Max.   :100.00  
##                     NA's   :700      NA's   :1486         NA's   :708     
##  saturated_fat_100g butyric_acid_100g caproic_acid_100g caprylic_acid_100g
##  Min.   : 0.000     Mode:logical      Mode:logical      Mode:logical      
##  1st Qu.: 0.200     NA's:1500         NA's:1500         NA's:1500         
##  Median : 1.700                                                           
##  Mean   : 4.874                                                           
##  3rd Qu.: 6.500                                                           
##  Max.   :57.000                                                           
##  NA's   :797                                                              
##  capric_acid_100g lauric_acid_100g myristic_acid_100g palmitic_acid_100g
##  Mode:logical     Mode:logical     Mode:logical       Mode:logical      
##  NA's:1500        NA's:1500        NA's:1500          NA's:1500         
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##  stearic_acid_100g arachidic_acid_100g behenic_acid_100g
##  Mode:logical      Mode:logical        Mode:logical     
##  NA's:1500         NA's:1500           NA's:1500        
##                                                         
##                                                         
##                                                         
##                                                         
##                                                         
##  lignoceric_acid_100g cerotic_acid_100g montanic_acid_100g
##  Mode:logical         Mode:logical      Mode:logical      
##  NA's:1500            NA's:1500         NA's:1500         
##                                                           
##                                                           
##                                                           
##                                                           
##                                                           
##  melissic_acid_100g monounsaturated_fat_100g polyunsaturated_fat_100g
##  Mode:logical       Min.   : 0.00            Min.   : 0.400          
##  NA's:1500          1st Qu.: 3.87            1st Qu.: 1.653          
##                     Median : 9.50            Median : 3.900          
##                     Mean   :19.77            Mean   : 9.986          
##                     3rd Qu.:29.00            3rd Qu.:12.700          
##                     Max.   :75.00            Max.   :46.200          
##                     NA's   :1465             NA's   :1464            
##  omega_3_fat_100g alpha_linolenic_acid_100g eicosapentaenoic_acid_100g
##  Min.   : 0.033   Min.   :0.0800            Min.   :0.721             
##  1st Qu.: 1.300   1st Qu.:0.0905            1st Qu.:0.721             
##  Median : 3.000   Median :0.1010            Median :0.721             
##  Mean   : 3.726   Mean   :0.1737            Mean   :0.721             
##  3rd Qu.: 3.200   3rd Qu.:0.2205            3rd Qu.:0.721             
##  Max.   :12.400   Max.   :0.3400            Max.   :0.721             
##  NA's   :1491     NA's   :1497              NA's   :1499              
##  docosahexaenoic_acid_100g omega_6_fat_100g linoleic_acid_100g
##  Min.   :1.09              Min.   :0.25     Min.   :0.5000    
##  1st Qu.:1.09              1st Qu.:0.25     1st Qu.:0.5165    
##  Median :1.09              Median :0.25     Median :0.5330    
##  Mean   :1.09              Mean   :0.25     Mean   :0.5330    
##  3rd Qu.:1.09              3rd Qu.:0.25     3rd Qu.:0.5495    
##  Max.   :1.09              Max.   :0.25     Max.   :0.5660    
##  NA's   :1499              NA's   :1499     NA's   :1498      
##  arachidonic_acid_100g gamma_linolenic_acid_100g
##  Mode:logical          Mode:logical             
##  NA's:1500             NA's:1500                
##                                                 
##                                                 
##                                                 
##                                                 
##                                                 
##  dihomo_gamma_linolenic_acid_100g omega_9_fat_100g oleic_acid_100g
##  Mode:logical                     Mode:logical     Mode:logical   
##  NA's:1500                        NA's:1500        NA's:1500      
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##  elaidic_acid_100g gondoic_acid_100g mead_acid_100g erucic_acid_100g
##  Mode:logical      Mode:logical      Mode:logical   Mode:logical    
##  NA's:1500         NA's:1500         NA's:1500      NA's:1500       
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##  nervonic_acid_100g trans_fat_100g   cholesterol_100g carbohydrates_100g
##  Mode:logical       Min.   :0.0000   Min.   :0.0000   Min.   :  0.000   
##  NA's:1500          1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:  3.792   
##                     Median :0.0000   Median :0.0000   Median : 13.500   
##                     Mean   :0.0105   Mean   :0.0265   Mean   : 27.958   
##                     3rd Qu.:0.0000   3rd Qu.:0.0026   3rd Qu.: 55.000   
##                     Max.   :0.1000   Max.   :0.4300   Max.   :100.000   
##                     NA's   :1481     NA's   :1477     NA's   :708       
##   sugars_100g     sucrose_100g   glucose_100g   fructose_100g 
##  Min.   :  0.00   Mode:logical   Mode:logical   Min.   :100   
##  1st Qu.:  1.00   NA's:1500      NA's:1500      1st Qu.:100   
##  Median :  4.05                                 Median :100   
##  Mean   : 12.66                                 Mean   :100   
##  3rd Qu.: 14.70                                 3rd Qu.:100   
##  Max.   :100.00                                 Max.   :100   
##  NA's   :788                                    NA's   :1499  
##   lactose_100g   maltose_100g   maltodextrins_100g  starch_100g   
##  Min.   :0.000   Mode:logical   Mode:logical       Min.   : 0.00  
##  1st Qu.:0.250   NA's:1500      NA's:1500          1st Qu.: 9.45  
##  Median :0.500                                     Median :39.50  
##  Mean   :2.933                                     Mean   :30.73  
##  3rd Qu.:4.400                                     3rd Qu.:42.85  
##  Max.   :8.300                                     Max.   :71.00  
##  NA's   :1497                                      NA's   :1493   
##   polyols_100g     fiber_100g     proteins_100g     casein_100g  
##  Min.   : 8.60   Min.   : 0.000   Min.   : 0.000   Min.   :1.1   
##  1st Qu.:59.10   1st Qu.: 0.500   1st Qu.: 1.500   1st Qu.:1.1   
##  Median :67.00   Median : 1.750   Median : 6.000   Median :1.1   
##  Mean   :56.06   Mean   : 2.823   Mean   : 7.563   Mean   :1.1   
##  3rd Qu.:69.80   3rd Qu.: 3.500   3rd Qu.:10.675   3rd Qu.:1.1   
##  Max.   :70.00   Max.   :46.700   Max.   :61.000   Max.   :1.1   
##  NA's   :1491    NA's   :994      NA's   :710      NA's   :1499  
##  serum_proteins_100g nucleotides_100g   salt_100g         sodium_100g     
##  Mode:logical        Mode:logical     Min.   :  0.0000   Min.   : 0.0000  
##  NA's:1500           NA's:1500        1st Qu.:  0.0438   1st Qu.: 0.0172  
##                                       Median :  0.4498   Median : 0.1771  
##                                       Mean   :  1.1205   Mean   : 0.4409  
##                                       3rd Qu.:  1.1938   3rd Qu.: 0.4700  
##                                       Max.   :102.0000   Max.   :40.0000  
##                                       NA's   :780        NA's   :780      
##   alcohol_100g   vitamin_a_100g   beta_carotene_100g vitamin_d_100g 
##  Min.   : 0.00   Min.   :0.0000   Mode:logical       Min.   :0e+00  
##  1st Qu.: 0.00   1st Qu.:0.0000   NA's:1500          1st Qu.:0e+00  
##  Median : 5.50   Median :0.0001                      Median :0e+00  
##  Mean   :10.07   Mean   :0.0003                      Mean   :0e+00  
##  3rd Qu.:13.00   3rd Qu.:0.0006                      3rd Qu.:0e+00  
##  Max.   :50.00   Max.   :0.0013                      Max.   :1e-04  
##  NA's   :1433    NA's   :1477                        NA's   :1485   
##  vitamin_e_100g   vitamin_k_100g vitamin_c_100g  vitamin_b1_100g 
##  Min.   :0.0005   Min.   :0      Min.   :0.000   Min.   :0.0001  
##  1st Qu.:0.0021   1st Qu.:0      1st Qu.:0.002   1st Qu.:0.0003  
##  Median :0.0044   Median :0      Median :0.019   Median :0.0004  
##  Mean   :0.0069   Mean   :0      Mean   :0.025   Mean   :0.0006  
##  3rd Qu.:0.0097   3rd Qu.:0      3rd Qu.:0.030   3rd Qu.:0.0010  
##  Max.   :0.0320   Max.   :0      Max.   :0.217   Max.   :0.0013  
##  NA's   :1478     NA's   :1498   NA's   :1459    NA's   :1478    
##  vitamin_b2_100g  vitamin_pp_100g  vitamin_b6_100g  vitamin_b9_100g
##  Min.   :0.0002   Min.   :0.0006   Min.   :0.0001   Min.   :0e+00  
##  1st Qu.:0.0003   1st Qu.:0.0033   1st Qu.:0.0002   1st Qu.:0e+00  
##  Median :0.0009   Median :0.0069   Median :0.0008   Median :1e-04  
##  Mean   :0.0011   Mean   :0.0086   Mean   :0.0112   Mean   :1e-04  
##  3rd Qu.:0.0013   3rd Qu.:0.0140   3rd Qu.:0.0012   3rd Qu.:2e-04  
##  Max.   :0.0066   Max.   :0.0160   Max.   :0.2000   Max.   :2e-04  
##  NA's   :1483     NA's   :1484     NA's   :1481     NA's   :1483   
##  vitamin_b12_100g  biotin_100g   pantothenic_acid_100g  silica_100g   
##  Min.   :0        Min.   :0      Min.   :0.0000        Min.   :8e-04  
##  1st Qu.:0        1st Qu.:0      1st Qu.:0.0007        1st Qu.:8e-04  
##  Median :0        Median :0      Median :0.0020        Median :8e-04  
##  Mean   :0        Mean   :0      Mean   :0.0027        Mean   :8e-04  
##  3rd Qu.:0        3rd Qu.:0      3rd Qu.:0.0051        3rd Qu.:8e-04  
##  Max.   :0        Max.   :0      Max.   :0.0060        Max.   :8e-04  
##  NA's   :1489     NA's   :1498   NA's   :1486          NA's   :1499   
##  bicarbonate_100g potassium_100g   chloride_100g     calcium_100g   
##  Min.   :0.0006   Min.   :0.0000   Min.   :0.0003   Min.   :0.0000  
##  1st Qu.:0.0678   1st Qu.:0.0650   1st Qu.:0.0006   1st Qu.:0.0450  
##  Median :0.1350   Median :0.1940   Median :0.0009   Median :0.1200  
##  Mean   :0.1692   Mean   :0.3288   Mean   :0.0144   Mean   :0.2040  
##  3rd Qu.:0.2535   3rd Qu.:0.3670   3rd Qu.:0.0214   3rd Qu.:0.1985  
##  Max.   :0.3720   Max.   :1.4300   Max.   :0.0420   Max.   :1.0000  
##  NA's   :1497     NA's   :1487     NA's   :1497     NA's   :1449    
##  phosphorus_100g    iron_100g      magnesium_100g     zinc_100g     
##  Min.   :0.0430   Min.   :0.0000   Min.   :0.0000   Min.   :0.0005  
##  1st Qu.:0.1938   1st Qu.:0.0012   1st Qu.:0.0670   1st Qu.:0.0009  
##  Median :0.3185   Median :0.0042   Median :0.1040   Median :0.0017  
##  Mean   :0.3777   Mean   :0.0045   Mean   :0.1066   Mean   :0.0016  
##  3rd Qu.:0.4340   3rd Qu.:0.0077   3rd Qu.:0.1300   3rd Qu.:0.0022  
##  Max.   :1.1550   Max.   :0.0137   Max.   :0.3330   Max.   :0.0026  
##  NA's   :1488     NA's   :1463     NA's   :1479     NA's   :1493    
##   copper_100g    manganese_100g fluoride_100g  selenium_100g 
##  Min.   :0e+00   Min.   :0      Min.   :0      Min.   :0     
##  1st Qu.:1e-04   1st Qu.:0      1st Qu.:0      1st Qu.:0     
##  Median :1e-04   Median :0      Median :0      Median :0     
##  Mean   :1e-04   Mean   :0      Mean   :0      Mean   :0     
##  3rd Qu.:1e-04   3rd Qu.:0      3rd Qu.:0      3rd Qu.:0     
##  Max.   :1e-04   Max.   :0      Max.   :0      Max.   :0     
##  NA's   :1498    NA's   :1499   NA's   :1498   NA's   :1499  
##  chromium_100g  molybdenum_100g  iodine_100g   caffeine_100g 
##  Mode:logical   Mode:logical    Min.   :0      Mode:logical  
##  NA's:1500      NA's:1500       1st Qu.:0      NA's:1500     
##                                 Median :0                    
##                                 Mean   :0                    
##                                 3rd Qu.:0                    
##                                 Max.   :0                    
##                                 NA's   :1499                 
##  taurine_100g   ph_100g        fruits_vegetables_nuts_100g
##  Mode:logical   Mode:logical   Min.   : 2.00              
##  NA's:1500      NA's:1500      1st Qu.:11.25              
##                                Median :42.00              
##                                Mean   :36.88              
##                                3rd Qu.:52.25              
##                                Max.   :80.00              
##                                NA's   :1470               
##  collagen_meat_protein_ratio_100g   cocoa_100g   chlorophyl_100g
##  Min.   :12.00                    Min.   :30     Mode:logical   
##  1st Qu.:13.50                    1st Qu.:47     NA's:1500      
##  Median :15.00                    Median :60                    
##  Mean   :15.67                    Mean   :57                    
##  3rd Qu.:17.50                    3rd Qu.:70                    
##  Max.   :20.00                    Max.   :81                    
##  NA's   :1497                     NA's   :1491                  
##  carbon_footprint_100g nutrition_score_fr_100g nutrition_score_uk_100g
##  Min.   : 12.00        Min.   :-12.000         Min.   :-12.000        
##  1st Qu.: 97.42        1st Qu.:  1.000         1st Qu.:  0.000        
##  Median :182.85        Median :  7.000         Median :  6.000        
##  Mean   :131.18        Mean   :  7.941         Mean   :  7.631        
##  3rd Qu.:190.78        3rd Qu.: 15.000         3rd Qu.: 16.000        
##  Max.   :198.70        Max.   : 28.000         Max.   : 28.000        
##  NA's   :1497          NA's   :825             NA's   :825
# Conclusion: hard to conclude anything with so many columns (160).
# Use glimpse() or names() for a more concise look.
library(dplyr)
glimpse(food)
## Observations: 1,500
## Variables: 160
## $ V1                                         <int> 1, 2, 3, 4, 5, 6, 7...
## $ code                                       <int> 100030, 100050, 100...
## $ url                                        <chr> "http://world-en.op...
## $ creator                                    <chr> "sebleouf", "foodor...
## $ created_t                                  <int> 1424747544, 1450316...
## $ created_datetime                           <chr> "2015-02-24T03:12:2...
## $ last_modified_t                            <int> 1438445887, 1450817...
## $ last_modified_datetime                     <chr> "2015-08-01T16:18:0...
## $ product_name                               <chr> "Confiture de frais...
## $ generic_name                               <chr> "", "", "Pâtes de ...
## $ quantity                                   <chr> "265 g", "375g", "1...
## $ packaging                                  <chr> "Bocal,Verre", "Pla...
## $ packaging_tags                             <chr> "bocal,verre", "pla...
## $ brands                                     <chr> "Casino Délices", ...
## $ brands_tags                                <chr> "casino-delices", "...
## $ categories                                 <chr> "Aliments et boisso...
## $ categories_tags                            <chr> "en:plant-based-foo...
## $ categories_en                              <chr> "Plant-based foods ...
## $ origins                                    <chr> "", "", "", "", "Ar...
## $ origins_tags                               <chr> "", "", "", "", "ar...
## $ manufacturing_places                       <chr> "France", "Belgium"...
## $ manufacturing_places_tags                  <chr> "france", "belgium"...
## $ labels                                     <chr> "", "", "", "Vegeta...
## $ labels_tags                                <chr> "", "", "", "en:veg...
## $ labels_en                                  <chr> "", "", "", "Vegeta...
## $ emb_codes                                  <chr> "EMB 78015", "", ""...
## $ emb_codes_tags                             <chr> "emb-78015", "", ""...
## $ first_packaging_code_geo                   <chr> "48.983333,2.066667...
## $ cities                                     <lgl> NA, NA, NA, NA, NA,...
## $ cities_tags                                <chr> "andresy-yvelines-f...
## $ purchase_places                            <chr> "Lyon,France", "NSW...
## $ stores                                     <chr> "Casino", "", "", "...
## $ countries                                  <chr> "France", "Australi...
## $ countries_tags                             <chr> "en:france", "en:au...
## $ countries_en                               <chr> "France", "Australi...
## $ ingredients_text                           <chr> "Sucre de canne, fr...
## $ allergens                                  <chr> "", "", "", "", "",...
## $ allergens_en                               <lgl> NA, NA, NA, NA, NA,...
## $ traces                                     <chr> "Lait,Fruits à coq...
## $ traces_tags                                <chr> "en:milk,en:nuts", ...
## $ traces_en                                  <chr> "Milk,Nuts", "", ""...
## $ serving_size                               <chr> "15 g", "", "", "",...
## $ no_nutriments                              <lgl> NA, NA, NA, NA, NA,...
## $ additives_n                                <int> 1, NA, 2, 5, 0, NA,...
## $ additives                                  <chr> "[ sucre-de-canne -...
## $ additives_tags                             <chr> "en:e440", "", "en:...
## $ additives_en                               <chr> "E440 - Pectins", "...
## $ ingredients_from_palm_oil_n                <int> 0, NA, 0, 0, 0, NA,...
## $ ingredients_from_palm_oil                  <lgl> NA, NA, NA, NA, NA,...
## $ ingredients_from_palm_oil_tags             <chr> "", "", "", "", "",...
## $ ingredients_that_may_be_from_palm_oil_n    <int> 0, NA, 0, 1, 0, NA,...
## $ ingredients_that_may_be_from_palm_oil      <lgl> NA, NA, NA, NA, NA,...
## $ ingredients_that_may_be_from_palm_oil_tags <chr> "", "", "", "e471-m...
## $ nutrition_grade_uk                         <lgl> NA, NA, NA, NA, NA,...
## $ nutrition_grade_fr                         <chr> "d", "", "", "d", "...
## $ pnns_groups_1                              <chr> "Sugary snacks", "S...
## $ pnns_groups_2                              <chr> "Sweets", "Chocolat...
## $ states                                     <chr> "en:to-be-checked, ...
## $ states_tags                                <chr> "en:to-be-checked,e...
## $ states_en                                  <chr> "To be checked,Comp...
## $ main_category                              <chr> "en:plant-based-foo...
## $ main_category_en                           <chr> "Plant-based foods ...
## $ image_url                                  <chr> "http://en.openfood...
## $ image_small_url                            <chr> "http://en.openfood...
## $ energy_100g                                <dbl> 918, NA, NA, 766, 2...
## $ energy_from_fat_100g                       <dbl> NA, NA, NA, NA, NA,...
## $ fat_100g                                   <dbl> 0.00, NA, NA, 16.70...
## $ saturated_fat_100g                         <dbl> 0.000, NA, NA, 9.90...
## $ butyric_acid_100g                          <lgl> NA, NA, NA, NA, NA,...
## $ caproic_acid_100g                          <lgl> NA, NA, NA, NA, NA,...
## $ caprylic_acid_100g                         <lgl> NA, NA, NA, NA, NA,...
## $ capric_acid_100g                           <lgl> NA, NA, NA, NA, NA,...
## $ lauric_acid_100g                           <lgl> NA, NA, NA, NA, NA,...
## $ myristic_acid_100g                         <lgl> NA, NA, NA, NA, NA,...
## $ palmitic_acid_100g                         <lgl> NA, NA, NA, NA, NA,...
## $ stearic_acid_100g                          <lgl> NA, NA, NA, NA, NA,...
## $ arachidic_acid_100g                        <lgl> NA, NA, NA, NA, NA,...
## $ behenic_acid_100g                          <lgl> NA, NA, NA, NA, NA,...
## $ lignoceric_acid_100g                       <lgl> NA, NA, NA, NA, NA,...
## $ cerotic_acid_100g                          <lgl> NA, NA, NA, NA, NA,...
## $ montanic_acid_100g                         <lgl> NA, NA, NA, NA, NA,...
## $ melissic_acid_100g                         <lgl> NA, NA, NA, NA, NA,...
## $ monounsaturated_fat_100g                   <dbl> NA, NA, NA, 2.9, 9....
## $ polyunsaturated_fat_100g                   <dbl> NA, NA, NA, 3.9, 32...
## $ omega_3_fat_100g                           <dbl> NA, NA, NA, NA, NA,...
## $ alpha_linolenic_acid_100g                  <dbl> NA, NA, NA, NA, NA,...
## $ eicosapentaenoic_acid_100g                 <dbl> NA, NA, NA, NA, NA,...
## $ docosahexaenoic_acid_100g                  <dbl> NA, NA, NA, NA, NA,...
## $ omega_6_fat_100g                           <dbl> NA, NA, NA, NA, NA,...
## $ linoleic_acid_100g                         <dbl> NA, NA, NA, NA, NA,...
## $ arachidonic_acid_100g                      <lgl> NA, NA, NA, NA, NA,...
## $ gamma_linolenic_acid_100g                  <lgl> NA, NA, NA, NA, NA,...
## $ dihomo_gamma_linolenic_acid_100g           <lgl> NA, NA, NA, NA, NA,...
## $ omega_9_fat_100g                           <lgl> NA, NA, NA, NA, NA,...
## $ oleic_acid_100g                            <lgl> NA, NA, NA, NA, NA,...
## $ elaidic_acid_100g                          <lgl> NA, NA, NA, NA, NA,...
## $ gondoic_acid_100g                          <lgl> NA, NA, NA, NA, NA,...
## $ mead_acid_100g                             <lgl> NA, NA, NA, NA, NA,...
## $ erucic_acid_100g                           <lgl> NA, NA, NA, NA, NA,...
## $ nervonic_acid_100g                         <lgl> NA, NA, NA, NA, NA,...
## $ trans_fat_100g                             <dbl> NA, NA, NA, NA, NA,...
## $ cholesterol_100g                           <dbl> NA, NA, NA, 0.00020...
## $ carbohydrates_100g                         <dbl> 54.00, NA, NA, 5.70...
## $ sugars_100g                                <dbl> 54.00, NA, NA, 4.20...
## $ sucrose_100g                               <lgl> NA, NA, NA, NA, NA,...
## $ glucose_100g                               <lgl> NA, NA, NA, NA, NA,...
## $ fructose_100g                              <int> NA, NA, NA, NA, NA,...
## $ lactose_100g                               <dbl> NA, NA, NA, NA, NA,...
## $ maltose_100g                               <lgl> NA, NA, NA, NA, NA,...
## $ maltodextrins_100g                         <lgl> NA, NA, NA, NA, NA,...
## $ starch_100g                                <dbl> NA, NA, NA, NA, NA,...
## $ polyols_100g                               <dbl> NA, NA, NA, NA, NA,...
## $ fiber_100g                                 <dbl> NA, NA, NA, 0.2, 9....
## $ proteins_100g                              <dbl> 0.00, NA, NA, 2.90,...
## $ casein_100g                                <dbl> NA, NA, NA, NA, NA,...
## $ serum_proteins_100g                        <lgl> NA, NA, NA, NA, NA,...
## $ nucleotides_100g                           <lgl> NA, NA, NA, NA, NA,...
## $ salt_100g                                  <dbl> 0.0000000, NA, NA, ...
## $ sodium_100g                                <dbl> 0.0000000, NA, NA, ...
## $ alcohol_100g                               <dbl> NA, NA, NA, NA, NA,...
## $ vitamin_a_100g                             <dbl> NA, NA, NA, NA, NA,...
## $ beta_carotene_100g                         <lgl> NA, NA, NA, NA, NA,...
## $ vitamin_d_100g                             <dbl> NA, NA, NA, NA, NA,...
## $ vitamin_e_100g                             <dbl> NA, NA, NA, NA, NA,...
## $ vitamin_k_100g                             <dbl> NA, NA, NA, NA, NA,...
## $ vitamin_c_100g                             <dbl> NA, NA, NA, NA, NA,...
## $ vitamin_b1_100g                            <dbl> NA, NA, NA, NA, NA,...
## $ vitamin_b2_100g                            <dbl> NA, NA, NA, NA, NA,...
## $ vitamin_pp_100g                            <dbl> NA, NA, NA, NA, NA,...
## $ vitamin_b6_100g                            <dbl> NA, NA, NA, NA, NA,...
## $ vitamin_b9_100g                            <dbl> NA, NA, NA, NA, NA,...
## $ vitamin_b12_100g                           <dbl> NA, NA, NA, NA, NA,...
## $ biotin_100g                                <dbl> NA, NA, NA, NA, NA,...
## $ pantothenic_acid_100g                      <dbl> NA, NA, NA, NA, NA,...
## $ silica_100g                                <dbl> NA, NA, NA, NA, NA,...
## $ bicarbonate_100g                           <dbl> NA, NA, NA, NA, NA,...
## $ potassium_100g                             <dbl> NA, NA, NA, NA, NA,...
## $ chloride_100g                              <dbl> NA, NA, NA, NA, NA,...
## $ calcium_100g                               <dbl> NA, NA, NA, NA, NA,...
## $ phosphorus_100g                            <dbl> NA, NA, NA, NA, 1.1...
## $ iron_100g                                  <dbl> NA, NA, NA, NA, 0.0...
## $ magnesium_100g                             <dbl> NA, NA, NA, NA, 0.1...
## $ zinc_100g                                  <dbl> NA, NA, NA, NA, NA,...
## $ copper_100g                                <dbl> NA, NA, NA, NA, NA,...
## $ manganese_100g                             <dbl> NA, NA, NA, NA, NA,...
## $ fluoride_100g                              <dbl> NA, NA, NA, NA, NA,...
## $ selenium_100g                              <dbl> NA, NA, NA, NA, NA,...
## $ chromium_100g                              <lgl> NA, NA, NA, NA, NA,...
## $ molybdenum_100g                            <lgl> NA, NA, NA, NA, NA,...
## $ iodine_100g                                <dbl> NA, NA, NA, NA, NA,...
## $ caffeine_100g                              <lgl> NA, NA, NA, NA, NA,...
## $ taurine_100g                               <lgl> NA, NA, NA, NA, NA,...
## $ ph_100g                                    <lgl> NA, NA, NA, NA, NA,...
## $ fruits_vegetables_nuts_100g                <dbl> 54, NA, NA, NA, NA,...
## $ collagen_meat_protein_ratio_100g           <int> NA, NA, NA, NA, NA,...
## $ cocoa_100g                                 <int> NA, NA, NA, NA, NA,...
## $ chlorophyl_100g                            <lgl> NA, NA, NA, NA, NA,...
## $ carbon_footprint_100g                      <dbl> NA, NA, NA, NA, NA,...
## $ nutrition_score_fr_100g                    <int> 11, NA, NA, 11, 17,...
## $ nutrition_score_uk_100g                    <int> 11, NA, NA, 11, 17,...
names(food)
##   [1] "V1"                                        
##   [2] "code"                                      
##   [3] "url"                                       
##   [4] "creator"                                   
##   [5] "created_t"                                 
##   [6] "created_datetime"                          
##   [7] "last_modified_t"                           
##   [8] "last_modified_datetime"                    
##   [9] "product_name"                              
##  [10] "generic_name"                              
##  [11] "quantity"                                  
##  [12] "packaging"                                 
##  [13] "packaging_tags"                            
##  [14] "brands"                                    
##  [15] "brands_tags"                               
##  [16] "categories"                                
##  [17] "categories_tags"                           
##  [18] "categories_en"                             
##  [19] "origins"                                   
##  [20] "origins_tags"                              
##  [21] "manufacturing_places"                      
##  [22] "manufacturing_places_tags"                 
##  [23] "labels"                                    
##  [24] "labels_tags"                               
##  [25] "labels_en"                                 
##  [26] "emb_codes"                                 
##  [27] "emb_codes_tags"                            
##  [28] "first_packaging_code_geo"                  
##  [29] "cities"                                    
##  [30] "cities_tags"                               
##  [31] "purchase_places"                           
##  [32] "stores"                                    
##  [33] "countries"                                 
##  [34] "countries_tags"                            
##  [35] "countries_en"                              
##  [36] "ingredients_text"                          
##  [37] "allergens"                                 
##  [38] "allergens_en"                              
##  [39] "traces"                                    
##  [40] "traces_tags"                               
##  [41] "traces_en"                                 
##  [42] "serving_size"                              
##  [43] "no_nutriments"                             
##  [44] "additives_n"                               
##  [45] "additives"                                 
##  [46] "additives_tags"                            
##  [47] "additives_en"                              
##  [48] "ingredients_from_palm_oil_n"               
##  [49] "ingredients_from_palm_oil"                 
##  [50] "ingredients_from_palm_oil_tags"            
##  [51] "ingredients_that_may_be_from_palm_oil_n"   
##  [52] "ingredients_that_may_be_from_palm_oil"     
##  [53] "ingredients_that_may_be_from_palm_oil_tags"
##  [54] "nutrition_grade_uk"                        
##  [55] "nutrition_grade_fr"                        
##  [56] "pnns_groups_1"                             
##  [57] "pnns_groups_2"                             
##  [58] "states"                                    
##  [59] "states_tags"                               
##  [60] "states_en"                                 
##  [61] "main_category"                             
##  [62] "main_category_en"                          
##  [63] "image_url"                                 
##  [64] "image_small_url"                           
##  [65] "energy_100g"                               
##  [66] "energy_from_fat_100g"                      
##  [67] "fat_100g"                                  
##  [68] "saturated_fat_100g"                        
##  [69] "butyric_acid_100g"                         
##  [70] "caproic_acid_100g"                         
##  [71] "caprylic_acid_100g"                        
##  [72] "capric_acid_100g"                          
##  [73] "lauric_acid_100g"                          
##  [74] "myristic_acid_100g"                        
##  [75] "palmitic_acid_100g"                        
##  [76] "stearic_acid_100g"                         
##  [77] "arachidic_acid_100g"                       
##  [78] "behenic_acid_100g"                         
##  [79] "lignoceric_acid_100g"                      
##  [80] "cerotic_acid_100g"                         
##  [81] "montanic_acid_100g"                        
##  [82] "melissic_acid_100g"                        
##  [83] "monounsaturated_fat_100g"                  
##  [84] "polyunsaturated_fat_100g"                  
##  [85] "omega_3_fat_100g"                          
##  [86] "alpha_linolenic_acid_100g"                 
##  [87] "eicosapentaenoic_acid_100g"                
##  [88] "docosahexaenoic_acid_100g"                 
##  [89] "omega_6_fat_100g"                          
##  [90] "linoleic_acid_100g"                        
##  [91] "arachidonic_acid_100g"                     
##  [92] "gamma_linolenic_acid_100g"                 
##  [93] "dihomo_gamma_linolenic_acid_100g"          
##  [94] "omega_9_fat_100g"                          
##  [95] "oleic_acid_100g"                           
##  [96] "elaidic_acid_100g"                         
##  [97] "gondoic_acid_100g"                         
##  [98] "mead_acid_100g"                            
##  [99] "erucic_acid_100g"                          
## [100] "nervonic_acid_100g"                        
## [101] "trans_fat_100g"                            
## [102] "cholesterol_100g"                          
## [103] "carbohydrates_100g"                        
## [104] "sugars_100g"                               
## [105] "sucrose_100g"                              
## [106] "glucose_100g"                              
## [107] "fructose_100g"                             
## [108] "lactose_100g"                              
## [109] "maltose_100g"                              
## [110] "maltodextrins_100g"                        
## [111] "starch_100g"                               
## [112] "polyols_100g"                              
## [113] "fiber_100g"                                
## [114] "proteins_100g"                             
## [115] "casein_100g"                               
## [116] "serum_proteins_100g"                       
## [117] "nucleotides_100g"                          
## [118] "salt_100g"                                 
## [119] "sodium_100g"                               
## [120] "alcohol_100g"                              
## [121] "vitamin_a_100g"                            
## [122] "beta_carotene_100g"                        
## [123] "vitamin_d_100g"                            
## [124] "vitamin_e_100g"                            
## [125] "vitamin_k_100g"                            
## [126] "vitamin_c_100g"                            
## [127] "vitamin_b1_100g"                           
## [128] "vitamin_b2_100g"                           
## [129] "vitamin_pp_100g"                           
## [130] "vitamin_b6_100g"                           
## [131] "vitamin_b9_100g"                           
## [132] "vitamin_b12_100g"                          
## [133] "biotin_100g"                               
## [134] "pantothenic_acid_100g"                     
## [135] "silica_100g"                               
## [136] "bicarbonate_100g"                          
## [137] "potassium_100g"                            
## [138] "chloride_100g"                             
## [139] "calcium_100g"                              
## [140] "phosphorus_100g"                           
## [141] "iron_100g"                                 
## [142] "magnesium_100g"                            
## [143] "zinc_100g"                                 
## [144] "copper_100g"                               
## [145] "manganese_100g"                            
## [146] "fluoride_100g"                             
## [147] "selenium_100g"                             
## [148] "chromium_100g"                             
## [149] "molybdenum_100g"                           
## [150] "iodine_100g"                               
## [151] "caffeine_100g"                             
## [152] "taurine_100g"                              
## [153] "ph_100g"                                   
## [154] "fruits_vegetables_nuts_100g"               
## [155] "collagen_meat_protein_ratio_100g"          
## [156] "cocoa_100g"                                
## [157] "chlorophyl_100g"                           
## [158] "carbon_footprint_100g"                     
## [159] "nutrition_score_fr_100g"                   
## [160] "nutrition_score_uk_100g"
# Conclusions:
# What and when information was added (1:9)
# Meta information about food (10:17, 22:27)
# Where it came from (18:21, 28:34)
# What it's made of (35:52)
# Nutrition grades (53:54)
# Unclear (55:63)
# Nutritional information (64:159).
# Some columns are duplicates: 
duplicates <- c(4, 6, 11, 13, 15, 17, 18, 20, 22, 
                24, 25, 28, 32, 34, 36, 38, 40, 
                44, 46, 48, 51, 54, 65, 158)
food2 <- food[,-duplicates]
# some columns are useless:
useless <- c(1, 2, 3, 32:41)
food3 <- food2[,-useless]

# We care only about the nutrition cols, the ones with 100g in their name.
library(stringr)
nutrition <- str_detect(colnames(food3), "100g")
summary(food3[,nutrition])
##  energy_from_fat_100g    fat_100g      saturated_fat_100g
##  Min.   :   0.00      Min.   :  0.00   Min.   : 0.000    
##  1st Qu.:  35.98      1st Qu.:  0.90   1st Qu.: 0.200    
##  Median : 237.00      Median :  6.00   Median : 1.700    
##  Mean   : 668.41      Mean   : 13.39   Mean   : 4.874    
##  3rd Qu.: 974.00      3rd Qu.: 20.00   3rd Qu.: 6.500    
##  Max.   :2900.00      Max.   :100.00   Max.   :57.000    
##  NA's   :1486         NA's   :708      NA's   :797       
##  butyric_acid_100g caproic_acid_100g caprylic_acid_100g capric_acid_100g
##  Mode:logical      Mode:logical      Mode:logical       Mode:logical    
##  NA's:1500         NA's:1500         NA's:1500          NA's:1500       
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##  lauric_acid_100g myristic_acid_100g palmitic_acid_100g stearic_acid_100g
##  Mode:logical     Mode:logical       Mode:logical       Mode:logical     
##  NA's:1500        NA's:1500          NA's:1500          NA's:1500        
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##  arachidic_acid_100g behenic_acid_100g lignoceric_acid_100g
##  Mode:logical        Mode:logical      Mode:logical        
##  NA's:1500           NA's:1500         NA's:1500           
##                                                            
##                                                            
##                                                            
##                                                            
##                                                            
##  cerotic_acid_100g montanic_acid_100g melissic_acid_100g
##  Mode:logical      Mode:logical       Mode:logical      
##  NA's:1500         NA's:1500          NA's:1500         
##                                                         
##                                                         
##                                                         
##                                                         
##                                                         
##  monounsaturated_fat_100g polyunsaturated_fat_100g omega_3_fat_100g
##  Min.   : 0.00            Min.   : 0.400           Min.   : 0.033  
##  1st Qu.: 3.87            1st Qu.: 1.653           1st Qu.: 1.300  
##  Median : 9.50            Median : 3.900           Median : 3.000  
##  Mean   :19.77            Mean   : 9.986           Mean   : 3.726  
##  3rd Qu.:29.00            3rd Qu.:12.700           3rd Qu.: 3.200  
##  Max.   :75.00            Max.   :46.200           Max.   :12.400  
##  NA's   :1465             NA's   :1464             NA's   :1491    
##  alpha_linolenic_acid_100g eicosapentaenoic_acid_100g
##  Min.   :0.0800            Min.   :0.721             
##  1st Qu.:0.0905            1st Qu.:0.721             
##  Median :0.1010            Median :0.721             
##  Mean   :0.1737            Mean   :0.721             
##  3rd Qu.:0.2205            3rd Qu.:0.721             
##  Max.   :0.3400            Max.   :0.721             
##  NA's   :1497              NA's   :1499              
##  docosahexaenoic_acid_100g omega_6_fat_100g linoleic_acid_100g
##  Min.   :1.09              Min.   :0.25     Min.   :0.5000    
##  1st Qu.:1.09              1st Qu.:0.25     1st Qu.:0.5165    
##  Median :1.09              Median :0.25     Median :0.5330    
##  Mean   :1.09              Mean   :0.25     Mean   :0.5330    
##  3rd Qu.:1.09              3rd Qu.:0.25     3rd Qu.:0.5495    
##  Max.   :1.09              Max.   :0.25     Max.   :0.5660    
##  NA's   :1499              NA's   :1499     NA's   :1498      
##  arachidonic_acid_100g gamma_linolenic_acid_100g
##  Mode:logical          Mode:logical             
##  NA's:1500             NA's:1500                
##                                                 
##                                                 
##                                                 
##                                                 
##                                                 
##  dihomo_gamma_linolenic_acid_100g omega_9_fat_100g oleic_acid_100g
##  Mode:logical                     Mode:logical     Mode:logical   
##  NA's:1500                        NA's:1500        NA's:1500      
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##  elaidic_acid_100g gondoic_acid_100g mead_acid_100g erucic_acid_100g
##  Mode:logical      Mode:logical      Mode:logical   Mode:logical    
##  NA's:1500         NA's:1500         NA's:1500      NA's:1500       
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##  nervonic_acid_100g trans_fat_100g   cholesterol_100g carbohydrates_100g
##  Mode:logical       Min.   :0.0000   Min.   :0.0000   Min.   :  0.000   
##  NA's:1500          1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:  3.792   
##                     Median :0.0000   Median :0.0000   Median : 13.500   
##                     Mean   :0.0105   Mean   :0.0265   Mean   : 27.958   
##                     3rd Qu.:0.0000   3rd Qu.:0.0026   3rd Qu.: 55.000   
##                     Max.   :0.1000   Max.   :0.4300   Max.   :100.000   
##                     NA's   :1481     NA's   :1477     NA's   :708       
##   sugars_100g     sucrose_100g   glucose_100g   fructose_100g 
##  Min.   :  0.00   Mode:logical   Mode:logical   Min.   :100   
##  1st Qu.:  1.00   NA's:1500      NA's:1500      1st Qu.:100   
##  Median :  4.05                                 Median :100   
##  Mean   : 12.66                                 Mean   :100   
##  3rd Qu.: 14.70                                 3rd Qu.:100   
##  Max.   :100.00                                 Max.   :100   
##  NA's   :788                                    NA's   :1499  
##   lactose_100g   maltose_100g   maltodextrins_100g  starch_100g   
##  Min.   :0.000   Mode:logical   Mode:logical       Min.   : 0.00  
##  1st Qu.:0.250   NA's:1500      NA's:1500          1st Qu.: 9.45  
##  Median :0.500                                     Median :39.50  
##  Mean   :2.933                                     Mean   :30.73  
##  3rd Qu.:4.400                                     3rd Qu.:42.85  
##  Max.   :8.300                                     Max.   :71.00  
##  NA's   :1497                                      NA's   :1493   
##   polyols_100g     fiber_100g     proteins_100g     casein_100g  
##  Min.   : 8.60   Min.   : 0.000   Min.   : 0.000   Min.   :1.1   
##  1st Qu.:59.10   1st Qu.: 0.500   1st Qu.: 1.500   1st Qu.:1.1   
##  Median :67.00   Median : 1.750   Median : 6.000   Median :1.1   
##  Mean   :56.06   Mean   : 2.823   Mean   : 7.563   Mean   :1.1   
##  3rd Qu.:69.80   3rd Qu.: 3.500   3rd Qu.:10.675   3rd Qu.:1.1   
##  Max.   :70.00   Max.   :46.700   Max.   :61.000   Max.   :1.1   
##  NA's   :1491    NA's   :994      NA's   :710      NA's   :1499  
##  serum_proteins_100g nucleotides_100g   salt_100g         sodium_100g     
##  Mode:logical        Mode:logical     Min.   :  0.0000   Min.   : 0.0000  
##  NA's:1500           NA's:1500        1st Qu.:  0.0438   1st Qu.: 0.0172  
##                                       Median :  0.4498   Median : 0.1771  
##                                       Mean   :  1.1205   Mean   : 0.4409  
##                                       3rd Qu.:  1.1938   3rd Qu.: 0.4700  
##                                       Max.   :102.0000   Max.   :40.0000  
##                                       NA's   :780        NA's   :780      
##   alcohol_100g   vitamin_a_100g   beta_carotene_100g vitamin_d_100g 
##  Min.   : 0.00   Min.   :0.0000   Mode:logical       Min.   :0e+00  
##  1st Qu.: 0.00   1st Qu.:0.0000   NA's:1500          1st Qu.:0e+00  
##  Median : 5.50   Median :0.0001                      Median :0e+00  
##  Mean   :10.07   Mean   :0.0003                      Mean   :0e+00  
##  3rd Qu.:13.00   3rd Qu.:0.0006                      3rd Qu.:0e+00  
##  Max.   :50.00   Max.   :0.0013                      Max.   :1e-04  
##  NA's   :1433    NA's   :1477                        NA's   :1485   
##  vitamin_e_100g   vitamin_k_100g vitamin_c_100g  vitamin_b1_100g 
##  Min.   :0.0005   Min.   :0      Min.   :0.000   Min.   :0.0001  
##  1st Qu.:0.0021   1st Qu.:0      1st Qu.:0.002   1st Qu.:0.0003  
##  Median :0.0044   Median :0      Median :0.019   Median :0.0004  
##  Mean   :0.0069   Mean   :0      Mean   :0.025   Mean   :0.0006  
##  3rd Qu.:0.0097   3rd Qu.:0      3rd Qu.:0.030   3rd Qu.:0.0010  
##  Max.   :0.0320   Max.   :0      Max.   :0.217   Max.   :0.0013  
##  NA's   :1478     NA's   :1498   NA's   :1459    NA's   :1478    
##  vitamin_b2_100g  vitamin_pp_100g  vitamin_b6_100g  vitamin_b9_100g
##  Min.   :0.0002   Min.   :0.0006   Min.   :0.0001   Min.   :0e+00  
##  1st Qu.:0.0003   1st Qu.:0.0033   1st Qu.:0.0002   1st Qu.:0e+00  
##  Median :0.0009   Median :0.0069   Median :0.0008   Median :1e-04  
##  Mean   :0.0011   Mean   :0.0086   Mean   :0.0112   Mean   :1e-04  
##  3rd Qu.:0.0013   3rd Qu.:0.0140   3rd Qu.:0.0012   3rd Qu.:2e-04  
##  Max.   :0.0066   Max.   :0.0160   Max.   :0.2000   Max.   :2e-04  
##  NA's   :1483     NA's   :1484     NA's   :1481     NA's   :1483   
##  vitamin_b12_100g  biotin_100g   pantothenic_acid_100g  silica_100g   
##  Min.   :0        Min.   :0      Min.   :0.0000        Min.   :8e-04  
##  1st Qu.:0        1st Qu.:0      1st Qu.:0.0007        1st Qu.:8e-04  
##  Median :0        Median :0      Median :0.0020        Median :8e-04  
##  Mean   :0        Mean   :0      Mean   :0.0027        Mean   :8e-04  
##  3rd Qu.:0        3rd Qu.:0      3rd Qu.:0.0051        3rd Qu.:8e-04  
##  Max.   :0        Max.   :0      Max.   :0.0060        Max.   :8e-04  
##  NA's   :1489     NA's   :1498   NA's   :1486          NA's   :1499   
##  bicarbonate_100g potassium_100g   chloride_100g     calcium_100g   
##  Min.   :0.0006   Min.   :0.0000   Min.   :0.0003   Min.   :0.0000  
##  1st Qu.:0.0678   1st Qu.:0.0650   1st Qu.:0.0006   1st Qu.:0.0450  
##  Median :0.1350   Median :0.1940   Median :0.0009   Median :0.1200  
##  Mean   :0.1692   Mean   :0.3288   Mean   :0.0144   Mean   :0.2040  
##  3rd Qu.:0.2535   3rd Qu.:0.3670   3rd Qu.:0.0214   3rd Qu.:0.1985  
##  Max.   :0.3720   Max.   :1.4300   Max.   :0.0420   Max.   :1.0000  
##  NA's   :1497     NA's   :1487     NA's   :1497     NA's   :1449    
##  phosphorus_100g    iron_100g      magnesium_100g     zinc_100g     
##  Min.   :0.0430   Min.   :0.0000   Min.   :0.0000   Min.   :0.0005  
##  1st Qu.:0.1938   1st Qu.:0.0012   1st Qu.:0.0670   1st Qu.:0.0009  
##  Median :0.3185   Median :0.0042   Median :0.1040   Median :0.0017  
##  Mean   :0.3777   Mean   :0.0045   Mean   :0.1066   Mean   :0.0016  
##  3rd Qu.:0.4340   3rd Qu.:0.0077   3rd Qu.:0.1300   3rd Qu.:0.0022  
##  Max.   :1.1550   Max.   :0.0137   Max.   :0.3330   Max.   :0.0026  
##  NA's   :1488     NA's   :1463     NA's   :1479     NA's   :1493    
##   copper_100g    manganese_100g fluoride_100g  selenium_100g 
##  Min.   :0e+00   Min.   :0      Min.   :0      Min.   :0     
##  1st Qu.:1e-04   1st Qu.:0      1st Qu.:0      1st Qu.:0     
##  Median :1e-04   Median :0      Median :0      Median :0     
##  Mean   :1e-04   Mean   :0      Mean   :0      Mean   :0     
##  3rd Qu.:1e-04   3rd Qu.:0      3rd Qu.:0      3rd Qu.:0     
##  Max.   :1e-04   Max.   :0      Max.   :0      Max.   :0     
##  NA's   :1498    NA's   :1499   NA's   :1498   NA's   :1499  
##  chromium_100g  molybdenum_100g  iodine_100g   caffeine_100g 
##  Mode:logical   Mode:logical    Min.   :0      Mode:logical  
##  NA's:1500      NA's:1500       1st Qu.:0      NA's:1500     
##                                 Median :0                    
##                                 Mean   :0                    
##                                 3rd Qu.:0                    
##                                 Max.   :0                    
##                                 NA's   :1499                 
##  taurine_100g   ph_100g        fruits_vegetables_nuts_100g
##  Mode:logical   Mode:logical   Min.   : 2.00              
##  NA's:1500      NA's:1500      1st Qu.:11.25              
##                                Median :42.00              
##                                Mean   :36.88              
##                                3rd Qu.:52.25              
##                                Max.   :80.00              
##                                NA's   :1470               
##  collagen_meat_protein_ratio_100g   cocoa_100g   chlorophyl_100g
##  Min.   :12.00                    Min.   :30     Mode:logical   
##  1st Qu.:13.50                    1st Qu.:47     NA's:1500      
##  Median :15.00                    Median :60                    
##  Mean   :15.67                    Mean   :57                    
##  3rd Qu.:17.50                    3rd Qu.:70                    
##  Max.   :20.00                    Max.   :81                    
##  NA's   :1497                     NA's   :1491                  
##  nutrition_score_fr_100g nutrition_score_uk_100g
##  Min.   :-12.000         Min.   :-12.000        
##  1st Qu.:  1.000         1st Qu.:  0.000        
##  Median :  7.000         Median :  6.000        
##  Mean   :  7.941         Mean   :  7.631        
##  3rd Qu.: 15.000         3rd Qu.: 16.000        
##  Max.   : 28.000         Max.   : 28.000        
##  NA's   :825             NA's   :825
# Replace NAs with 0.
missing <- is.na(food3$sugars_100g)
food3$sugars_100g[missing] <- 0
food4 <- food3[(food3$sugars_100g > 0), ]

# How many observations are packaged in plastic?
plastic <- str_detect(food3$packaging, "plasti")
sum(plastic)
## [1] 232

Example 4

library(readxl)
att_url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_1294/datasets/attendance.xls"
# Cannot read Excel directly from internet, so downloaded to local drive.
# Following command works, but content is unreadable.  Comment out.
#download.file(att_url, file.path("C:/Users/mpfol/OneDrive/Documents/Data Science/Data", "att.xls"))
att_path <- file.path("C:/Users/mpfol/OneDrive/Documents/Data Science/Data", "attendance.xls")
att <- read_excel(att_path, skip = 1)

Data

Types of variables: * Numeric (quantitative). * * continuous * * discrete * Categorical (qualitative) * * oridinal * * nominal (a.k.a., categorical)

Identify data subsets with table(). Subset the data with dplyr::filter. The subsetted data will still contain an empty bin for the unused factor levels. If that is a problem, drop them with droplevels().

library(openintro)
## Warning: package 'openintro' was built under R version 3.4.4
## Please visit openintro.org for free statistics materials
## 
## Attaching package: 'openintro'
## The following objects are masked _by_ '.GlobalEnv':
## 
##     cars, iris
## The following object is masked from 'package:reshape2':
## 
##     tips
## The following object is masked from 'package:ggplot2':
## 
##     diamonds
## The following objects are masked from 'package:datasets':
## 
##     cars, trees
library(dplyr)

# load data from library.  Note that with lazy-loading, it is almost never necesary to load data from packages this way.
data(hsb2)

# hsb2$schtyp has two levels, public and private.  Filtering for just public does not change the level definition of hsb2$schtyp.  Use droplevels to remove private. 
table(hsb2$schtyp)
## 
##  public private 
##     168      32
hsb2_public <- hsb2 %>%
  filter(schtyp == "public")
table(hsb2_public$schtyp)
## 
##  public private 
##     168       0
hsb2_public$schtyp <- droplevels(hsb2_public$schtyp)
table(hsb2_public$schtyp)
## 
## public 
##    168

Discretize a numeric variable into a categorical variable with ifelse or casewhen.

library(openintro)
library(dplyr)
data(hsb2)

# parens cause output to go both to variable and to io.
(med_read <- median(hsb2$read))
## [1] 50
hsb2 <- hsb2 %>%
  mutate(read_cat = ifelse(read < med_read, "< median", 
                           ifelse(read == med_read, "median", "> median")))
hsb2 %>%
  count(read_cat)
## # A tibble: 3 x 2
##   read_cat     n
##   <chr>    <int>
## 1 < median    83
## 2 > median    99
## 3 median      18
hsb2 <- hsb2 %>%
  mutate(
    race_white_oth = case_when(
      race == "white" ~ "white", 
      race != "white" ~ "other"  
    )
  )

Visualize numerical data with scatterplots.

Observational Studies and Experiments

Random sampling allows for generalization and applies to both observational studies and experiments. Random assignment allows from causation conclusions and applies only to experiments.

library(openintro)
library(dplyr)
data(hsb2)

hsb2 %>%
  count(race, schtyp) %>%
  group_by(race) %>%
  mutate(prop = n / sum(n))
## # A tibble: 8 x 4
## # Groups:   race [4]
##   race             schtyp      n   prop
##   <chr>            <fct>   <int>  <dbl>
## 1 african american public     18 0.900 
## 2 african american private     2 0.100 
## 3 asian            public     10 0.909 
## 4 asian            private     1 0.0909
## 5 hispanic         public     22 0.917 
## 6 hispanic         private     2 0.0833
## 7 white            public    118 0.814 
## 8 white            private    27 0.186

Sampling Strategies

Censuses are expensive, and the underlying population changes anyway. Instead we sample. Four common methods are simple random sampling, stratified sampling (group into strata first to guarantee equal representation), clustered sampling (cluster population, randomly choose clusters, then take census), and multi-stage sampling (cluster population, randomly choose clusters, then randomly sample).

library(openintro)
library(dplyr)

data(county)
county_noDC <- county %>%
  filter(state != "District of Columbia") %>%
  droplevels()

# simple random sample
county_srs <- county_noDC %>%
  sample_n(size = 150)
county_srs %>%
  group_by(state) %>%
  count()
## # A tibble: 40 x 2
## # Groups:   state [40]
##    state          n
##    <fct>      <int>
##  1 Alabama        4
##  2 Alaska         1
##  3 Arkansas       5
##  4 California     1
##  5 Colorado       3
##  6 Florida        3
##  7 Georgia        3
##  8 Hawaii         1
##  9 Idaho          3
## 10 Illinois       5
## # ... with 30 more rows
# stratified sample
county_ss <- county_noDC %>%
  group_by(state) %>%
  sample_n(size = 3)
county_ss %>%
  group_by(state) %>%
  count()
## # A tibble: 50 x 2
## # Groups:   state [50]
##    state           n
##    <fct>       <int>
##  1 Alabama         3
##  2 Alaska          3
##  3 Arizona         3
##  4 Arkansas        3
##  5 California      3
##  6 Colorado        3
##  7 Connecticut     3
##  8 Delaware        3
##  9 Florida         3
## 10 Georgia         3
## # ... with 40 more rows

The principles of experimential design are control, randomize, replicate (sufficiently large sample), block (control for confounding variables). Explanatory variables are conditions you can impose on the experimental units, while blocking variables are characteristics that the experimental units come with that you would like to control for. Control for a variable with stratfying in random sampling, and blocking in random assigment.

Exploratory Data Analysis

Exploring Categorical Variables

This section covers graphical and numerical summaries of categorical variables. levels(x) provides access (display or write) to the levels of a factor variable. Another way to get the levels is with a table. Explore the relationship between two categorical variables with a contingency table, or a stack bar char.

library(readr)
library(dplyr)
library(ggplot2)

comics <- read_csv("https://assets.datacamp.com/production/course_1796/datasets/comics.csv")
## Parsed with column specification:
## cols(
##   name = col_character(),
##   id = col_character(),
##   align = col_character(),
##   eye = col_character(),
##   hair = col_character(),
##   gender = col_character(),
##   gsm = col_character(),
##   alive = col_character(),
##   appearances = col_integer(),
##   first_appear = col_character(),
##   publisher = col_character()
## )
# Drop underrepresented data from analysis
comics <- comics %>%
  filter(align != "Reformed Criminals") %>%
  droplevels()
# Convert character strings to factors
comics <- comics %>%
  mutate(name = as.factor(name),
         id = factor(id),
         align = factor(align, 
                           levels = c("Bad", "Neutral", "Good")), # sets order
         eye = factor(eye),
         hair = factor(hair),
         gender = factor(gender),
         alive = factor(alive),
         first_appear = factor(first_appear),
         publisher = factor(publisher)
         )

# Levels of align, gender
levels(comics$align)
## [1] "Bad"     "Neutral" "Good"
levels(comics$gender)
## [1] "Female" "Male"   "Other"
# Table of counts, (how to calculate proportions?)
table(comics$align)
## 
##     Bad Neutral    Good 
##    9615    2773    7468
#margin.table(comics$align)

# Contingency table or proportional table
table(comics$align, comics$gender)
##          
##           Female Male Other
##   Bad       1573 7561    32
##   Neutral    836 1799    17
##   Good      2490 4809    17
options(scipen = 999, digits = 3) # sig digits
prop.table(table(comics$align, comics$gender))
##          
##             Female     Male    Other
##   Bad     0.082210 0.395160 0.001672
##   Neutral 0.043692 0.094021 0.000888
##   Good    0.130135 0.251333 0.000888
# Conditional proportion.  Condition on rows (margin = 1), or cols (margin = 2).
prop.table(table(comics$align, comics$gender), margin = 1)
##          
##            Female    Male   Other
##   Bad     0.17161 0.82490 0.00349
##   Neutral 0.31523 0.67836 0.00641
##   Good    0.34035 0.65733 0.00232
# Marginal bar-plot of gender counts
ggplot(comics, aes(x = gender)) +
  geom_bar()

# Improve by ordering
comics$gender = factor(comics$gender,
                       levels = c("Male", "Female", "Other"))
ggplot(comics, aes(x = gender)) +
  geom_bar()

# Conditional stacked bar-plot of align counts conditioned on gender
ggplot(comics, aes(x = gender, fill = align)) +
  geom_bar()

# Conditional stacked bar-chart of align proportions conditioned on gender
ggplot(comics, aes(x = gender, fill = align)) +
  geom_bar(position = "fill") +
  ylab("proportion")

# Conditional faceted bar-bart of align counts conditioned on gender
ggplot(comics, aes(x = align)) +
  geom_bar() +
  facet_wrap(~ gender)

# Same, but as side-by-side barchart
ggplot(comics, aes(x = gender, fill = align)) + 
  geom_bar(position = "dodge")

# Create side-by-side barchart of alignment by gender
ggplot(comics, aes(x = gender, fill = align)) + 
  geom_bar(position = "dodge") +
  theme(axis.text.x = element_text(angle = 90)) # vertical x-axis labels

Exploring Numerical Data

For univariate exploration, visualize discrete numerical data with limited values using geom_dotplot(). Otherwise, use geom_histogram(), or density plot geom_density(). geom_boxplot() is also possible, but it usually in bivariate exploration.

library(readr)
library(ggplot2)
cars <- read_csv("https://assets.datacamp.com/production/course_1796/datasets/cars04.csv")
## Parsed with column specification:
## cols(
##   name = col_character(),
##   sports_car = col_logical(),
##   suv = col_logical(),
##   wagon = col_logical(),
##   minivan = col_logical(),
##   pickup = col_logical(),
##   all_wheel = col_logical(),
##   rear_wheel = col_logical(),
##   msrp = col_integer(),
##   dealer_cost = col_integer(),
##   eng_size = col_double(),
##   ncyl = col_integer(),
##   horsepwr = col_integer(),
##   city_mpg = col_integer(),
##   hwy_mpg = col_integer(),
##   weight = col_integer(),
##   wheel_base = col_integer(),
##   length = col_integer(),
##   width = col_integer()
## )
# Dotplot for discrete numerical variable with limited distinct values
ggplot(cars, aes(x = weight)) +
  geom_dotplot(dotsize = 0.4) +
  labs(title = "Car Weight Distribution",
       subtitle = "Dot plots can get unweieldy, but show the most detail.")
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bindot).

# Histogram for any numerical data with many distinct values
ggplot(cars, aes(x = weight)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).

For bi-variate analysis of a numerical variable with a factor variable, use a geom_boxplot(), or overlaid geom_density(). Both density plots and box plots display the central tendency and spread of the data. The box plot is more robust to outliers. The density plot reveals multi-modal distributions.

Add a third dimension to plots with facet_grid(rowvar ~ colvar) or mapping to shape, color, size, pattern, movement, x-coord, or y-coord.

library(readr)
library(dplyr)
library(ggplot2)
cars <- read_csv("https://assets.datacamp.com/production/course_1796/datasets/cars04.csv")
## Parsed with column specification:
## cols(
##   name = col_character(),
##   sports_car = col_logical(),
##   suv = col_logical(),
##   wagon = col_logical(),
##   minivan = col_logical(),
##   pickup = col_logical(),
##   all_wheel = col_logical(),
##   rear_wheel = col_logical(),
##   msrp = col_integer(),
##   dealer_cost = col_integer(),
##   eng_size = col_double(),
##   ncyl = col_integer(),
##   horsepwr = col_integer(),
##   city_mpg = col_integer(),
##   hwy_mpg = col_integer(),
##   weight = col_integer(),
##   wheel_base = col_integer(),
##   length = col_integer(),
##   width = col_integer()
## )
# Box plots of city mpg by ncyl.
# (note: set aes(x = 1) to create a single boxplot)
ggplot(cars, aes(x = as.factor(ncyl), y = city_mpg)) +
  geom_boxplot()
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).

# Overlaid density plots for same data
ggplot(cars, aes(x = city_mpg, fill = as.factor(ncyl))) +
  geom_density(alpha = .3)
## Warning: Removed 14 rows containing non-finite values (stat_density).
## Warning: Groups with fewer than two data points have been dropped.

# Use piping with a filter to single out interesting subsets of numerical data.
cars %>% 
  filter(msrp < 25000) %>%
  ggplot(aes(x = horsepwr)) +
  geom_histogram(binwidth = 3) +
  xlim(c(90, 550)) +
  labs(title = "Distribution of horsepower for cars under $25k (bw = 3)")
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

# In this case, the box plot works better because of the wide range of outliers
cars %>%
  ggplot(aes(x = city_mpg)) +
  geom_density()
## Warning: Removed 14 rows containing non-finite values (stat_density).

cars %>%
  ggplot(aes(x = 1, y = city_mpg)) +
  geom_boxplot()
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).

# In this case, the density plot works better because of the three modes.
cars %>% 
  ggplot(aes(x = width)) +
  geom_density()
## Warning: Removed 28 rows containing non-finite values (stat_density).

cars %>%
  ggplot(aes(x = 1, y = width)) +
  geom_boxplot()
## Warning: Removed 28 rows containing non-finite values (stat_boxplot).

# Facet hists using hwy mileage and ncyl
cars %>%
  filter(ncyl %in% c(2, 4, 6)) %>%
  ggplot(aes(x = hwy_mpg)) +
  geom_histogram() +
  facet_grid(ncyl ~ suv, labeller = label_both) +
  ggtitle("hwy_mpg distribution, ncyl vs suv")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 7 rows containing non-finite values (stat_bin).

Numerical Summaries

Measures of center include mean, median, and mode. Measures of variability include var(), sd(x), IQR(x), and diff(range(x)). The IQR is useful for heavily data sets that are skewed or have extreme observations (mean + sd, and median + IQR). Measures of shape include modality (uniform, unimodal, bimodal, multimodal), and skew (symmetric, right-skewed, left-skewed). Highly skewed distributions can make it very difficult to learn anything from a visualization. Transformations can be helpful in revealing the more subtle structure. Transform skewed data with the natural logarithm function log().

library(readr)

life <- read_csv("https://assets.datacamp.com/production/course_1796/datasets/life_exp_raw.csv")
## Parsed with column specification:
## cols(
##   State = col_character(),
##   County = col_character(),
##   fips = col_integer(),
##   Year = col_integer(),
##   `Female life expectancy (years)` = col_double(),
##   `Female life expectancy (national, years)` = col_double(),
##   `Female life expectancy (state, years)` = col_double(),
##   `Male life expectancy (years)` = col_double(),
##   `Male life expectancy (national, years)` = col_double(),
##   `Male life expectancy (state, years)` = col_double()
## )
life$expectancy_f <- life$`Female life expectancy (years)`
life$`Female life expectancy (years)` <- NULL
life$expect_f_nat <- life$`Female life expectancy (national, years)`
life$'Female life expectancy (national, years)' <- NULL
life$expect_f_st <- life$`Female life expectancy (state, years)`
life$'Female life expectancy (state, years)' <- NULL
life$expect_m <- life$`Male life expectancy (years)`
life$'Male life expectancy (years)' <- NULL
life$expect_m_nat <- life$`Male life expectancy (national, years)`
life$'Male life expectancy (national, years)' <- NULL
life$expect_m_st <- life$`Male life expectancy (state, years)`
life$'Male life expectancy (state, years)' <- NULL

# Do west-coasts states have a higher life-expectancy?

life <- life %>%
  mutate(west_coast = State %in% c("California", "Oregon", "Washington"))
life %>%
  group_by(west_coast) %>%
  summarize(mean(expect_m), 
            median(expect_m))
## # A tibble: 2 x 3
##   west_coast `mean(expect_m)` `median(expect_m)`
##   <lgl>                 <dbl>              <dbl>
## 1 FALSE                  72.6               72.8
## 2 TRUE                   74.4               74.3
ggplot(life, aes(x = west_coast, y = expect_m)) +
  geom_boxplot() +
  labs(title = "Male Life Expectancy Distribution",
       x = "West Coast State")

ggplot(life, aes(x = expect_m, fill = west_coast)) +
  geom_density(alpha = 0.3) +
  labs(title = "Male Life Expectancy Distribution")

Case Study

What attributes of an email are associated with spam?

library(openintro)
library(ggplot2)
library(dplyr)

email2 <- email %>%
  mutate(spam = factor(ifelse(spam == 1, "spam", "not-spam")))

# Is it size?
email2 %>%
  group_by(spam) %>%
  summarize(
#  mean(num_char),
#    sd(num_char),
    median(num_char),
    IQR(num_char)
    )
## # A tibble: 2 x 3
##   spam     `median(num_char)` `IQR(num_char)`
##   <fct>                 <dbl>           <dbl>
## 1 not-spam               6.83           13.6 
## 2 spam                   1.05            2.82
email2 %>%
  mutate(log_num_char = log(num_char)) %>%
  ggplot(aes(x = spam, y = log_num_char)) +
    geom_boxplot()

# Conclusions:
# The typical spam email is considerably shorter, but there is still a lot of overlap.

# is it number of exclamation marks?
# If not unimodal, then use faceted histogram or density.  Otherwise, use boxplot.  Appropriate summary stats for boxplot are median and IQR, and mean, sd otherwise.
email2 %>%
  group_by(spam) %>%
  summarize(median(exclaim_mess),
    IQR(exclaim_mess),
    mean(exclaim_mess),
    sd(exclaim_mess))
## # A tibble: 2 x 5
##   spam     `median(exclaim_mess)` `IQR(exclaim_mess)` `mean(exclaim_mess)`
##   <fct>                     <dbl>               <dbl>                <dbl>
## 1 not-spam                     1.                  5.                 6.51
## 2 spam                         0.                  1.                 7.32
## # ... with 1 more variable: `sd(exclaim_mess)` <dbl>
# Create plot for spam and exclaim_mess
ggplot(email2, aes(x = spam, y = log(exclaim_mess + 0.1))) + 
  geom_boxplot()

ggplot(email2, aes(x = log(exclaim_mess+.01), fill = spam)) +
  geom_density(alpha = 0.6)

# Histogram seems most helpful
email2 %>%
  mutate(log_exclaim_mess = log(exclaim_mess+0.01)) %>%
  ggplot(aes(x = log_exclaim_mess)) +
    geom_histogram() +
    facet_wrap(~ spam)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Conclusions: 
# Most common is 0 or 1 exclamation marks in both classes of email.
# Even after transformation, the distribution is right-skewed in both classes of email .
# The typical number of exclamations in the not-spam group appears to be slightly higher than in the spam group.

# What to do when there are so many emails with zero exclamation marks?  One strategy is to analyze the two separate.  Another is to collapse them into a two level categorical variable.

# How about number of images?  There are 3,811 instances of no images and just a few with >=1.  Colapse the image variable into a logical.
table(email2$image)
## 
##    0    1    2    3    4    5    9   20 
## 3811   76   17   11    2    2    1    1
email %>%
  mutate(has_image = (image > 0)) %>%
  ggplot(aes(x = has_image, fill = spam)) +
  geom_bar(position = "fill")

# Check data integrity.
# There should be no instances of more images than attachments if images are a form a attachment.
sum(email$image > email$attach)
## [1] 0

Exploratory Data Analysis Case Study

UN Voting Dataset.

library(dplyr)
library(broom)
library(tidyr)
library(purrr)
#library(countrycode)

votes <- readRDS("C:/Users/mpfol/OneDrive/Documents/Data Analysis/Data/votes.rds")
glimpse(votes)
## Observations: 508,929
## Variables: 4
## $ rcid    <dbl> 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46...
## $ session <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ vote    <dbl> 1, 1, 9, 1, 1, 1, 9, 9, 9, 9, 9, 9, 9, 9, 9, 1, 9, 1, ...
## $ ccode   <int> 2, 20, 31, 40, 41, 42, 51, 52, 53, 54, 55, 56, 57, 58,...
descriptions <- readRDS("C:/Users/mpfol/OneDrive/Documents/Data Analysis/Data/descriptions.rds")

# There are six columns in the descriptions dataset that describe the topic of a resolution:
#me: Palestinian conflict
#nu: Nuclear weapons and nuclear material
#di: Arms control and disarmament
#hr: Human rights
#co: Colonialism
#ec: Economic development

# The vote column represents the country's vote:
# 1 = Yes
# 2 = Abstain
# 3 = No
# 8 = Not present
# 9 = Not a member
# Get rid of  "Not present" and "Not a member".
# The first session was 1946, so create column year = session + 1945.
votes_processed <- votes %>%
  filter(vote %in% c(1, 2, 3)) %>%
  mutate(year = session + 1945)

# Join data sets
votes_joined <- inner_join(votes_processed, descriptions, c("rcid", "session"))
## Warning: Column `rcid` has different attributes on LHS and RHS of join
## Warning: Column `session` has different attributes on LHS and RHS of join
votes_gathered <- votes_joined %>%
  gather(key = topic, value = has_topic, c(me:ec)) %>%
  filter(has_topic != 0)
votes_tidied <- votes_gathered %>%
  mutate(topic = recode(topic,
                        "me" = "Palestinian conflict",
                        "nu" = "Nuclear weapons and nuclear material",
                        "di" = "Arms control and disarmament",
                        "hr" = "Human rights",
                        "co" = "Colonialism",
                        "ec" = "Economic development"))

by_country_year_topic <- votes_tidied %>%
group_by(ccode, year, topic) %>%
summarize(total = n(),
percent_yes = mean(vote == 1)) %>%
ungroup()

US_by_country_year_topic <- by_country_year_topic %>%
#filter(country == "United States")
  filter(ccode == 20)

# Plot % yes over time for the US, faceting by topic
ggplot(US_by_country_year_topic, aes(x = year, y = percent_yes)) +
geom_line() +
facet_wrap(~ topic)

# Create a model by country 
# nest all columns othere than the key into tibles
# map the model to each tibble
# use function tidy to create tibble from model summary
# unmap the columns back into main data frame.
country_topic_coefficients <- by_country_year_topic %>%
  nest(-ccode, -topic) %>%
  mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)),
         tidied = map(model, tidy)) %>%
  unnest(tidied)
## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(x): essentially perfect fit: summary may be
## unreliable
country_topic_filtered <- country_topic_coefficients %>%
filter(term == "year") %>%
mutate(p.adjusted = p.adjust(p.value)) %>%
filter(p.adjusted < 0.05)

vanuatu_by_country_year_topic <- by_country_year_topic %>%
filter(ccode == 20)

# Plot of percentage "yes" over time, faceted by topic
ggplot(vanuatu_by_country_year_topic, aes(x = year, y = percent_yes)) +
  geom_line() +
  facet_wrap(~ topic)

US_co_by_year <- votes_joined %>%
filter(ccode == 20, co == 1) %>%
group_by(year) %>%
summarize(percent_yes = mean(vote == 1))

# Graph the % of "yes" votes over time
ggplot(US_co_by_year, aes(x = year, y = percent_yes)) +
  geom_line()

by_year_country <- votes_joined %>%
  group_by(ccode, year) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))


US_by_year <- by_year_country %>%
  filter(ccode == 2)
US_fit <- lm(percent_yes ~ year, US_by_year)

# Fit model for the United Kingdom
UK_by_year <- by_year_country %>%
  filter(ccode == 20)
UK_fit <- lm(percent_yes ~ year, UK_by_year)

US_tidied <- tidy(US_fit)
UK_tidied <- tidy(UK_fit)

# Combine the two tidied models
bind_rows(US_tidied, UK_tidied)
##          term estimate std.error statistic      p.value
## 1 (Intercept) 12.66415  1.837974      6.89 0.0000000848
## 2        year -0.00624  0.000928     -6.72 0.0000001367
## 3 (Intercept) -2.48418  1.891412     -1.31 0.1983885448
## 4        year  0.00152  0.000955      1.59 0.1223589541

Right now, the by_year_country data frame has one row per country-vote pair. So that you can model each country individually, you’re going to “nest” all columns besides country, which will result in a data frame with one row per country. The data for each individual country will then be stored in a list column called data.

library(tidyr)
library(purrr)

country_coefficients <- by_year_country %>%
  nest(-ccode) %>%
  mutate(models = map(data, ~ lm(percent_yes ~ year, .))) %>%
  mutate(tidied = map(models, tidy)) %>%
  unnest(tidied)

# when you have lots of p-values, like one for each country, you run into the problem of multiple hypothesis testing, where you have to set a stricter threshold. The p.adjust() function is a simple way to correct for this, where p.adjust(p.value) on a vector of p-values returns a set that you can trust.
country_coefficients %>%
  filter(term == "year") %>%
  filter(p.adjust(p.value) < 0.05)
## # A tibble: 61 x 6
##    ccode term  estimate std.error statistic  p.value
##    <int> <chr>    <dbl>     <dbl>     <dbl>    <dbl>
##  1     2 year  -0.00624  0.000928     -6.72 1.37e- 7
##  2    40 year   0.00461  0.000721      6.40 3.43e- 7
##  3    41 year   0.00538  0.000699      7.70 8.82e- 9
##  4    42 year   0.00806  0.000914      8.81 5.96e-10
##  5    70 year   0.00530  0.000884      6.00 1.08e- 6
##  6    90 year   0.00585  0.00104       5.62 3.27e- 6
##  7    91 year   0.00772  0.000921      8.38 1.43e- 9
##  8    92 year   0.00614  0.000851      7.22 3.38e- 8
##  9    93 year   0.00708  0.00107       6.60 1.92e- 7
## 10    94 year   0.00654  0.000812      8.05 3.39e- 9
## # ... with 51 more rows

Join datasets.

votes <- readRDS("C:/Users/mpfol/OneDrive/Documents/Data Analysis/Data/votes.rds")

Foundations of Inference

Statistical inference is the process of making claims about a population based on information from a sample. Inferential statistics attempts to reject a null hypothesis, \(H_0\).

The logic of statistical inference is to compare an observed statistic to the distribution of statistics arising from the null distribution.

Here is a brief exploration of the data used in this section.

library(NHANES)
## Warning: package 'NHANES' was built under R version 3.4.4
library(ggplot2)
library(dplyr)

glimpse(NHANES)
## Observations: 10,000
## Variables: 76
## $ ID               <int> 51624, 51624, 51624, 51625, 51630, 51638, 516...
## $ SurveyYr         <fct> 2009_10, 2009_10, 2009_10, 2009_10, 2009_10, ...
## $ Gender           <fct> male, male, male, male, female, male, male, f...
## $ Age              <int> 34, 34, 34, 4, 49, 9, 8, 45, 45, 45, 66, 58, ...
## $ AgeDecade        <fct>  30-39,  30-39,  30-39,  0-9,  40-49,  0-9,  ...
## $ AgeMonths        <int> 409, 409, 409, 49, 596, 115, 101, 541, 541, 5...
## $ Race1            <fct> White, White, White, Other, White, White, Whi...
## $ Race3            <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Education        <fct> High School, High School, High School, NA, So...
## $ MaritalStatus    <fct> Married, Married, Married, NA, LivePartner, N...
## $ HHIncome         <fct> 25000-34999, 25000-34999, 25000-34999, 20000-...
## $ HHIncomeMid      <int> 30000, 30000, 30000, 22500, 40000, 87500, 600...
## $ Poverty          <dbl> 1.36, 1.36, 1.36, 1.07, 1.91, 1.84, 2.33, 5.0...
## $ HomeRooms        <int> 6, 6, 6, 9, 5, 6, 7, 6, 6, 6, 5, 10, 6, 10, 1...
## $ HomeOwn          <fct> Own, Own, Own, Own, Rent, Rent, Own, Own, Own...
## $ Work             <fct> NotWorking, NotWorking, NotWorking, NA, NotWo...
## $ Weight           <dbl> 87.4, 87.4, 87.4, 17.0, 86.7, 29.8, 35.2, 75....
## $ Length           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ HeadCirc         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Height           <dbl> 165, 165, 165, 105, 168, 133, 131, 167, 167, ...
## $ BMI              <dbl> 32.2, 32.2, 32.2, 15.3, 30.6, 16.8, 20.6, 27....
## $ BMICatUnder20yrs <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ BMI_WHO          <fct> 30.0_plus, 30.0_plus, 30.0_plus, 12.0_18.5, 3...
## $ Pulse            <int> 70, 70, 70, NA, 86, 82, 72, 62, 62, 62, 60, 6...
## $ BPSysAve         <int> 113, 113, 113, NA, 112, 86, 107, 118, 118, 11...
## $ BPDiaAve         <int> 85, 85, 85, NA, 75, 47, 37, 64, 64, 64, 63, 7...
## $ BPSys1           <int> 114, 114, 114, NA, 118, 84, 114, 106, 106, 10...
## $ BPDia1           <int> 88, 88, 88, NA, 82, 50, 46, 62, 62, 62, 64, 7...
## $ BPSys2           <int> 114, 114, 114, NA, 108, 84, 108, 118, 118, 11...
## $ BPDia2           <int> 88, 88, 88, NA, 74, 50, 36, 68, 68, 68, 62, 7...
## $ BPSys3           <int> 112, 112, 112, NA, 116, 88, 106, 118, 118, 11...
## $ BPDia3           <int> 82, 82, 82, NA, 76, 44, 38, 60, 60, 60, 64, 7...
## $ Testosterone     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ DirectChol       <dbl> 1.29, 1.29, 1.29, NA, 1.16, 1.34, 1.55, 2.12,...
## $ TotChol          <dbl> 3.49, 3.49, 3.49, NA, 6.70, 4.86, 4.09, 5.82,...
## $ UrineVol1        <int> 352, 352, 352, NA, 77, 123, 238, 106, 106, 10...
## $ UrineFlow1       <dbl> NA, NA, NA, NA, 0.094, 1.538, 1.322, 1.116, 1...
## $ UrineVol2        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ UrineFlow2       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Diabetes         <fct> No, No, No, No, No, No, No, No, No, No, No, N...
## $ DiabetesAge      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ HealthGen        <fct> Good, Good, Good, NA, Good, NA, NA, Vgood, Vg...
## $ DaysPhysHlthBad  <int> 0, 0, 0, NA, 0, NA, NA, 0, 0, 0, 10, 0, 4, NA...
## $ DaysMentHlthBad  <int> 15, 15, 15, NA, 10, NA, NA, 3, 3, 3, 0, 0, 0,...
## $ LittleInterest   <fct> Most, Most, Most, NA, Several, NA, NA, None, ...
## $ Depressed        <fct> Several, Several, Several, NA, Several, NA, N...
## $ nPregnancies     <int> NA, NA, NA, NA, 2, NA, NA, 1, 1, 1, NA, NA, N...
## $ nBabies          <int> NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, NA...
## $ Age1stBaby       <int> NA, NA, NA, NA, 27, NA, NA, NA, NA, NA, NA, N...
## $ SleepHrsNight    <int> 4, 4, 4, NA, 8, NA, NA, 8, 8, 8, 7, 5, 4, NA,...
## $ SleepTrouble     <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, N...
## $ PhysActive       <fct> No, No, No, NA, No, NA, NA, Yes, Yes, Yes, Ye...
## $ PhysActiveDays   <int> NA, NA, NA, NA, NA, NA, NA, 5, 5, 5, 7, 5, 1,...
## $ TVHrsDay         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ CompHrsDay       <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ TVHrsDayChild    <int> NA, NA, NA, 4, NA, 5, 1, NA, NA, NA, NA, NA, ...
## $ CompHrsDayChild  <int> NA, NA, NA, 1, NA, 0, 6, NA, NA, NA, NA, NA, ...
## $ Alcohol12PlusYr  <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes...
## $ AlcoholDay       <int> NA, NA, NA, NA, 2, NA, NA, 3, 3, 3, 1, 2, 6, ...
## $ AlcoholYear      <int> 0, 0, 0, NA, 20, NA, NA, 52, 52, 52, 100, 104...
## $ SmokeNow         <fct> No, No, No, NA, Yes, NA, NA, NA, NA, NA, No, ...
## $ Smoke100         <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, Y...
## $ Smoke100n        <fct> Smoker, Smoker, Smoker, NA, Smoker, NA, NA, N...
## $ SmokeAge         <int> 18, 18, 18, NA, 38, NA, NA, NA, NA, NA, 13, N...
## $ Marijuana        <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes...
## $ AgeFirstMarij    <int> 17, 17, 17, NA, 18, NA, NA, 13, 13, 13, NA, 1...
## $ RegularMarij     <fct> No, No, No, NA, No, NA, NA, No, No, No, NA, Y...
## $ AgeRegMarij      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2...
## $ HardDrugs        <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, N...
## $ SexEver          <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes...
## $ SexAge           <int> 16, 16, 16, NA, 12, NA, NA, 13, 13, 13, 17, 2...
## $ SexNumPartnLife  <int> 8, 8, 8, NA, 10, NA, NA, 20, 20, 20, 15, 7, 1...
## $ SexNumPartYear   <int> 1, 1, 1, NA, 1, NA, NA, 0, 0, 0, NA, 1, 1, NA...
## $ SameSex          <fct> No, No, No, NA, Yes, NA, NA, Yes, Yes, Yes, N...
## $ SexOrientation   <fct> Heterosexual, Heterosexual, Heterosexual, NA,...
## $ PregnantNow      <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
# Bar plot of Home Ownership by Gender
ggplot(NHANES, aes(x = Gender, fill = HomeOwn)) + 
  geom_bar(position = "fill") +
  ylab("Relative frequencies")

# Density plot of SleepHrsNight colored by SleepTrouble
ggplot(NHANES, aes(x = SleepHrsNight, color = SleepTrouble)) + 
  geom_density(adjust = 2) + 
  facet_wrap(~ HealthGen)
## Warning: Removed 2245 rows containing non-finite values (stat_density).

Calculate the difference in home ownership proportions for males and females. Our statistic Males % minus female % is -.0078.

library(infer)
## Warning: package 'infer' was built under R version 3.4.4
homes <- NHANES %>%
  select(Gender, HomeOwn) %>%
  filter(HomeOwn %in% c("Own", "Rent"))

diff_orig <- homes %>%   
  group_by(Gender) %>%
  summarize(prop_own = mean(HomeOwn == "Own")) %>%
  summarize(obs_diff_prop = diff(prop_own)) # male - female
diff_orig
## # A tibble: 1 x 1
##   obs_diff_prop
##           <dbl>
## 1      -0.00783

Model natural variability (the null distribution) by shuffling observations to remove any relationships that might exist in the population.

Use the infer package to model \(H_0\) of no relationship between HomeOwn and Gender. Randomize the data to calculate permuted statistics.
1. Specify the model with specify, defining the success condition for the proportion.
2. Set the null hypothesis with hypothesize.
3. Generate reps permutations of the data with generate.
4. Calculate summary statistics with calculate This process ensures that there is no relationship between home ownership and gender, so any difference in home ownership proportion is due only to natural variability.

homeown_perm <- homes %>%
  specify(HomeOwn ~ Gender, success = "Own") %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>% 
  calculate(stat = "diff in props", 
            order = c("male", "female"))

ggplot(homeown_perm, aes(x = stat)) + 
#  geom_dotplot(binwidth = 0.001) + 
  geom_density()

The observed difference of -0.0078 falls below the bulk of the density of shuffled differences. How many randomly permuted differences were as extreme as the observed difference.

homeown_perm <- homeown_perm %>%
  mutate(diff_perm = stat) %>%
  mutate(stat = NULL)
homeown_perm$diff_orig <- rep(diff_orig[[1,1]], nrow(homeown_perm))

# Plot permuted differences, diff_perm
ggplot(homeown_perm, aes(x = diff_perm)) + 
  geom_density() +
  geom_vline(aes(xintercept = diff_orig), color = "red")

homeown_perm %>%
  summarize(sum(diff_perm <= diff_orig))
## # A tibble: 1 x 1
##   `sum(diff_perm <= diff_orig)`
##                           <int>
## 1                           220

209 permutations produced a difference in proportions more extreme than the measured value, so do not reject \(H_0\). Our data is consistent with the hypothesis of no difference in home ownership across gender.

Complete Case Study

Consider the study on gender discrimination. Our hypotheses are: \(H_0\): gender and promotion are unrelated variables. \(H_A\): men are more likely to be promoted.

Start with exploratory analysis

library(dplyr)
library(infer)

disc <- readRDS("Data/disc_new.rds")
glimpse(disc)
## Observations: 48
## Variables: 2
## $ promote <fct> promoted, promoted, promoted, promoted, promoted, prom...
## $ sex     <fct> male, male, male, male, male, male, male, male, male, ...
# Counts and proportions (as shown in course)
#disc %>%
#  count(promote, sex)
#disc %>%
#  group_by(sex) %>%
#  summarize(promoted_prop = mean(promote == "promoted"))

# Better way: Contingency table
table(disc$promote, disc$sex)
##               
##                female male
##   not_promoted      7    6
##   promoted         17   18
options(scipen = 999, digits = 3) # sig digits
# marginal proportion (margin = 2 for cols)
prop.table(table(disc$promote, disc$sex), margin = 2)
##               
##                female  male
##   not_promoted  0.292 0.250
##   promoted      0.708 0.750
# Calculate difference in proportions, male - female
diff_orig <- disc %>%
  group_by(sex) %>%
  summarize(prop_prom = mean(promote == "promoted")) %>%
  summarize(stat = diff(prop_prom)) %>%
  pull()

Perform the permutation. To quantify the extreme permuted (null) differences, use the quantile() function. The p-value is the probability of observing data as or more extreme given that the null hypothesis is true.

library(ggplot2)

# Replicate the data frame, permuting the promote variable
disc_perm <- disc %>%
  specify(promote ~ sex, success = "promoted") %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in props", order = c("male", "female"))

ggplot(disc_perm, aes(x = stat)) + 
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = diff_orig), color = "red")

disc_perm %>% 
  summarize(
    q.01 = quantile(stat, p = 0.01),
    q.05 = quantile(stat, p = 0.05),
    q.10 = quantile(stat, p = 0.10),
    q.90 = quantile(stat, p = 0.90),
    q.95 = quantile(stat, p = 0.95),
    q.99 = quantile(stat, p = 0.99)
  )
## # A tibble: 1 x 6
##     q.01   q.05   q.10  q.90  q.95  q.99
##    <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 -0.292 -0.208 -0.125 0.208 0.208 0.292
disc_perm %>%
  visualize(obs_stat = diff_orig, direction = "greater")
## Warning: `visualize()` shouldn't be used to plot p-value. Arguments
## `obs_stat`, `obs_stat_color`, `pvalue_fill`, and `direction` are
## deprecated. Use `shade_p_value()` instead.

disc_perm %>%
  get_p_value(obs_stat = diff_orig, direction = "greater")
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.518
disc_perm %>%
  summarize(p_value = mean(diff_orig <= stat))
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.518

Inference for Categorical Data

Question: How much confidence in the scientific community did people have in 2016? The answers to this question have been summarized as “High” or “Low” levels of confidence and are stored in the consci variable.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.4
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v tibble  1.4.2     v forcats 0.3.0
## Warning: package 'forcats' was built under R version 3.4.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x dplyr::between()         masks data.table::between()
## x dplyr::combine()         masks gdata::combine()
## x lubridate::date()        masks base::date()
## x dplyr::filter()          masks stats::filter()
## x dplyr::first()           masks gdata::first(), data.table::first()
## x jsonlite::flatten()      masks purrr::flatten()
## x lubridate::hour()        masks data.table::hour()
## x lubridate::intersect()   masks base::intersect()
## x lubridate::isoweek()     masks data.table::isoweek()
## x gdata::keep()            masks purrr::keep()
## x dplyr::lag()             masks stats::lag()
## x dplyr::last()            masks gdata::last(), data.table::last()
## x lubridate::mday()        masks data.table::mday()
## x lubridate::minute()      masks data.table::minute()
## x lubridate::month()       masks data.table::month()
## x lubridate::quarter()     masks data.table::quarter()
## x lubridate::second()      masks data.table::second()
## x lubridate::setdiff()     masks base::setdiff()
## x data.table::transpose()  masks purrr::transpose()
## x lubridate::union()       masks base::union()
## x lubridate::wday()        masks data.table::wday()
## x lubridate::week()        masks data.table::week()
## x lubridate::yday()        masks data.table::yday()
## x lubridate::year()        masks data.table::year()
library(dplyr)
library(ggplot2)

load("Data/gss.RData")
glimpse(gss)
## Observations: 50,346
## Variables: 28
## $ id       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ year     <dbl> 1982, 1982, 1982, 1982, 1982, 1982, 1982, 1982, 1982,...
## $ age      <fct> 41, 49, 27, 24, 57, 29, 21, 68, 54, 80, 74, 30, 53, 3...
## $ class    <fct> WORKING CLASS, WORKING CLASS, MIDDLE CLASS, MIDDLE CL...
## $ degree   <fct> LT HIGH SCHOOL, HIGH SCHOOL, HIGH SCHOOL, HIGH SCHOOL...
## $ sex      <fct> MALE, FEMALE, FEMALE, FEMALE, MALE, MALE, FEMALE, MAL...
## $ marital  <fct> MARRIED, MARRIED, NEVER MARRIED, NEVER MARRIED, NEVER...
## $ race     <fct> WHITE, WHITE, WHITE, WHITE, WHITE, WHITE, WHITE, WHIT...
## $ region   <fct> NEW ENGLAND, NEW ENGLAND, NEW ENGLAND, NEW ENGLAND, N...
## $ partyid  <fct> STRONG DEMOCRAT, STRONG DEMOCRAT, IND,NEAR DEM, IND,N...
## $ happy    <fct> PRETTY HAPPY, NOT TOO HAPPY, VERY HAPPY, PRETTY HAPPY...
## $ grass    <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ relig    <fct> CATHOLIC, CATHOLIC, CATHOLIC, CATHOLIC, CATHOLIC, CAT...
## $ cappun2  <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ cappun   <fct> FAVOR, FAVOR, FAVOR, OPPOSE, OPPOSE, FAVOR, OPPOSE, F...
## $ finalter <fct> STAYED SAME, WORSE, BETTER, BETTER, STAYED SAME, BETT...
## $ protest3 <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ natspac  <fct> ABOUT RIGHT, TOO MUCH, TOO LITTLE, TOO LITTLE, ABOUT ...
## $ natarms  <fct> TOO LITTLE, TOO LITTLE, ABOUT RIGHT, TOO MUCH, TOO LI...
## $ conclerg <fct> ONLY SOME, ONLY SOME, A GREAT DEAL, ONLY SOME, A GREA...
## $ confed   <fct> ONLY SOME, ONLY SOME, ONLY SOME, ONLY SOME, A GREAT D...
## $ conpress <fct> ONLY SOME, ONLY SOME, A GREAT DEAL, ONLY SOME, A GREA...
## $ conjudge <fct> HARDLY ANY, ONLY SOME, A GREAT DEAL, A GREAT DEAL, A ...
## $ consci   <fct> ONLY SOME, ONLY SOME, A GREAT DEAL, A GREAT DEAL, A G...
## $ conlegis <fct> ONLY SOME, ONLY SOME, ONLY SOME, ONLY SOME, A GREAT D...
## $ zodiac   <fct> TAURUS, CAPRICORN, VIRGO, PISCES, CAPRICORN, LEO, LIB...
## $ oversamp <dbl> 1.24, 1.24, 1.24, 1.24, 1.24, 1.24, 1.24, 1.24, 1.24,...
## $ postlife <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
# Collapse levels 
# from "A GREAT DEAL" "ONLY SOME"    "HARDLY ANY"
# to "High", "Low"
levels(gss$consci)
## [1] "A GREAT DEAL" "ONLY SOME"    "HARDLY ANY"
levels(gss$consci) <- c("High", "Low", "Low")

gss2016 <- gss %>% 
  filter(year == 2016)

ggplot(gss2016, aes(x = consci)) +
  geom_bar()

# proportion of high conf.
p_hat <- gss2016 %>%
  summarize(p = mean(consci == "High", na.rm = TRUE))

Calculate the standard error To assess uncertainty in this estimate of the number of people that have “High” confidence in the scientific community. Start by considering how different the data might look in just a single bootstrap sample.

library(infer)

# Create single bootstrap data set
b1 <- gss2016 %>%
  specify(response = consci, success = "High") %>%
  generate(reps = 1, type = "bootstrap")
## Warning: Removed 983 rows containing missing values.
# Plot distribution of consci
ggplot(b1, aes(x = consci)) +
  geom_bar()

# Compute proportion with high conf
b1 %>%
  summarize(p = mean(consci == "High"))
## # A tibble: 1 x 2
##   replicate     p
##       <int> <dbl>
## 1         1 0.439

Correlation and Regression

Visualize bi-variate continuous relationships with a scatterplot. Or, discretize one variable and create boxplots. The goal of bivariate analysis is to characterize the form (linear, quadratic, nonlinear), direction (positive or negative), strength (scatter), and outliers.

library(openintro)
library(dplyr)
library(ggplot2)

data(ncbirths)
ggplot(ncbirths, aes(x = weeks, y = weight)) +
  geom_point()
## Warning: Removed 2 rows containing missing values (geom_point).

ggplot(data = ncbirths, 
       aes(x = cut(weeks, breaks = 5), y = weight)) + 
  geom_boxplot()

Use data transformation to create linear relationship where raw data does not reveal it. When values have a wide range, transform with log.

# Original
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
  geom_point()  

# Scatterplot with coord_trans()
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
  geom_point() + 
  coord_trans(x = "log10", y = "log10")

# Scatterplot with scale_x_log10() and scale_y_log10()
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
  geom_point() +
  scale_x_log10() + scale_y_log10()

Use alpha shading and/or jittering to address overplotting of integer variables. Identify outliers and note how the relationship between two variables may change as a result of removing them. Be careful with rate statistics since they mask the underlying number of observations.

Correlation

Correlation is a numeric measure of the degree of linear relationship between two variables. It is defined

\(r(x,y) = Cov(x,y) / \sqrt(SXX * SYY)\)

The cor(x, y) function will compute the Pearson product-moment correlation between variables, x and y. Specify use = "pairwise.complete.obs" when cols may include missing data.

ncbirths %>%
  summarize(N = n(), r = cor(weight, weeks, use = "pairwise.complete.obs"))
##      N    r
## 1 1000 0.67

Simple Linear Regression

The simple linear regression model minimizes the sum of squared residuals. Linear regression is a specific example of a larger class of smooth models. The geom_smooth() function allows you to draw such models over a scatterplot of the data itself. This technique is known as visualizing the model in the data space. The method argument to geom_smooth() allows you to specify what class of smooth model you want to see. Since we are exploring linear models, we’ll set this argument to the value "lm".

ggplot(data = bdims, aes(x = wgt, y = hgt)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

The lm function creates an object of type lm. There are functions to extract everything about the model, including fitted.values() and residuals(). The broom package can conver the lm object into a tidy data frame with the augment() function.

library(broom)

mod <- lm(hgt ~ wgt, data = bdims)
coef(mod)
## (Intercept)         wgt 
##     136.182       0.506
summary(mod)
## 
## Call:
## lm(formula = hgt ~ wgt, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.716  -3.878   0.008   4.653  18.688 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 136.1819     1.5391    88.5 <0.0000000000000002 ***
## wgt           0.5056     0.0219    23.1 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.56 on 505 degrees of freedom
## Multiple R-squared:  0.515,  Adjusted R-squared:  0.514 
## F-statistic:  535 on 1 and 505 DF,  p-value: <0.0000000000000002
mod_tidy <- augment(mod)

Make predictions by pairing the lm model with a new data frame. These types of predictions are called out-of-sample.

mod <- lm(wgt ~ hgt, data = bdims)
ben <- data.frame(wgt = c(74.8), hgt = c(182.8))
predict(mod, newdata = ben)
##  1 
## 81

The geom_smooth() function makes it easy to add a simple linear regression line to a scatterplot of the corresponding variables. And in fact, there are more complicated regression models that can be visualized in the data space with geom_smooth(). However, there may still be times when we will want to add regression lines to our scatterplot manually. To do this, we will use the geom_abline() function, which takes slope and intercept arguments.

mod <- lm(wgt ~ hgt, data = bdims)
coefs <- as.data.frame(coef(mod))
ggplot(data = bdims, aes(x = hgt, y = wgt)) + 
  geom_point() + 
  geom_abline(aes(intercept = coef(mod)["(Intercept)"], slope = coef(mod)["hgt"]),  
              color = "dodgerblue")

The root mean squared error (RMSE), aka, residual standard error, defined as the square root of the sum of squared residuals divided by the degrees of freedom, roughly measures how far the predicted values are from the observed value, expressed in the same units as the the observed value.

\(R^2 - 1 - SSE/SST\) where SST is the some of squared errors of the null model y ~ 1 (which equals \(\bar y\). I.e., \(R^2 = 1 - var(resid) / var(y)\).

Leverage is defined as distance from \(\bar{x}\). Cooke’s distance combines the residual with the leverage score.

mod %>%
  augment() %>%
  arrange(desc(.hat)) %>%
  head()
##    wgt hgt .fitted .se.fit .resid   .hat .sigma  .cooksd .std.resid
## 1 85.5 198    96.6    1.26 -11.08 0.0182   9.30 0.013373     -1.201
## 2 90.9 197    95.6    1.21  -4.66 0.0170   9.31 0.002208     -0.505
## 3 49.8 147    44.8    1.13   5.02 0.0148   9.31 0.002212      0.543
## 4 80.7 194    91.9    1.07 -11.20 0.0131   9.30 0.009758     -1.211
## 5 95.9 193    91.4    1.05   4.51 0.0126   9.32 0.001523      0.488
## 6 44.8 150    47.1    1.04  -2.32 0.0124   9.32 0.000397     -0.251

Random Variables

Bimonial

library(ggplot2)
library(dplyr)
options(scipen = 999, digits = 2) # sig digits

n = 1:20
density = dbinom(x = 3, size = 1:20, 0.3)
data.frame(n, density) %>%
ggplot(aes(x = n, y = density)) +
  geom_col() +
  geom_text(
    aes(label = round(density,2), y = density + 0.01),
    position = position_dodge(0.9),
    size = 3,
    vjust = 0
  ) +
  labs(title = "P(X = 3) in n Bernoulli trials where p = 0.3",
       subtitle = "The distribution of at bats for a .300 hitter to get 3 hits peaks at 10.",
       x = "trial number (n)",
       y = "Density")

Hypothesis Testing

To conduct a hypothesis test, draw the sampling distribution and shade the p-value. Calculate the test statistic and reject \(H_0\) if the p-value is less than significance level \(\alpha\).

A sample of n = 2,500 students finds y = 1,200 binge drink, a proportion of p = 0.48. With 95% confidence, is this greater than the \(\pi = 0.44\)* national average? \(H_a\) is \(\pi > 0.44\), so \(H_0\) is \(pi < 0.44\). Reject \(H_0\) if \(z = (p - \pi_0)/SE_0\) is greater than \(z_{.05???2} = 1.96\).*

library(ggplot2)

n = 2500
y = 1200
p = y / n
pi = 0.44
se = sqrt(p * (1 - p) / n)
se_0 = sqrt(pi * (1 - pi) / n)
rr = qnorm(p = 0.975, mean = pi, sd  = se)
z = (p - pi) / se_0
z_crit = qnorm(p = 0.975, mean = 0, sd = 1)
dat <- data.frame(prob = rnorm(n = 1000, mean = pi, sd = se_0))
ci_lo <- qnorm(.025, mean = p, sd = se) # same as p - z_crit * SE
ci_hi <- qnorm(.025, mean = p, sd = se) # same as p + z_crit * SE
cat("95% confidence interval is (", ci_lo, ",", ci_hi, ")")
## 95% confidence interval is ( 0.46 , 0.46 )
ggplot(dat, aes(x = prob)) +
  geom_density() +
  geom_vline(aes(xintercept = p), color = "red") + 
  geom_rect(
         aes(xmin = rr, xmax = +Inf, ymin = -Inf, ymax = +Inf),
         inherit.aes = FALSE, fill = "red", alpha = 0.05) +
  labs(title = "Null Normal Sampling Probability Distribution",
       subtitle = "Rejection region and sample statistic shown in red",
       x = "Proportion",
       y = "Density")

One Sample Proportion Comparison

The one sample proportion comparison test compares measured proportion \(p\) to the hypothesized population proportion \(\pi_0\) with null hypothesis \(H_0:\pi=\pi_0\).

Exact Binomial

Use the exact binomial probability comparison test for small samples.

What is the probability of calculating \(p>.95\)* from a sample of \(n=200\) when \(\pi=.90\)?*

sum(dbinom(x = 190:200, size = 200, prob = .90))
## [1] 0.0081
library(ggplot2)

n = 200
y = 190
p = y / n
pi = 0.90
rr = qbinom(p = 0.975, size = n, prob = pi)
dat <- data.frame(mydist = rbinom(n = 1000, size = n, prob = pi))
ci_lo <- qbinom(.025, size = n, prob = p)
ci_hi <- qbinom(.975, size = n, prob = p)
cat("95% confidence interval is (", ci_lo, ",", ci_hi, ")")
## 95% confidence interval is ( 184 , 196 )
ggplot(dat, aes(x = mydist)) +
  geom_density() +
  geom_vline(aes(xintercept = y), color = "red") + 
  geom_rect(
         aes(xmin = rr, xmax = +Inf, ymin = -Inf, ymax = +Inf),
         inherit.aes = FALSE, fill = "red", alpha = 0.05) +
  labs(title = "Null Binomial Sampling Probability Distribution",
       subtitle = "Rejection region and sample statistic shown in red",
       x = "Count",
       y = "Density")

Wald Confidence Interval

Use a Wald confidence interval when the binomial distribution is approximately normal.

A maintenance crew resolves y = 33 of n = 50 repair requests within 24 hours, a proportion of p = 0.66. With 95% confidence, what proportion of repair requests does the maintenance crew resolve within 24 hours?

n = 50
y = 33
p = y / n
se = sqrt(p * (1 - p) / n)
ci_lo <- qnorm(.025, mean = p, sd = se) # same as p - z_crit * SE
ci_hi <- qnorm(.975, mean = p, sd = se) # same as p + z_crit * SE
cat("95% confidence interval is (", ci_lo, ",", ci_hi, ")")
## 95% confidence interval is ( 0.53 , 0.79 )

Wilson-Agresti-Coull (WAC)

If the normality condition does not hold, use the Wilson-Agresti-Coull (WAC) confidence interval instead.

A maintenance crew resolves y = 43 of n = 50 repair requests within 24 hours, a proportion of p = 0.86. With 95% confidence, what proportion of repair requests does the maintenance crew resolve within 24 hours?

n = 50
y = 43
z_crit = qnorm(p = 0.975, mean = 0, sd = 1)
p = (y + 0.5 * z_crit^2) / (n + z_crit^2)
se = sqrt(p * (1 - p) / n)
me = z_crit * se
ci_lo <- qnorm(.025, mean = p, sd = se) # same as p - z_crit * SE
ci_hi <- qnorm(.975, mean = p, sd = se) # same as p + z_crit * SE
cat("95% confidence interval is", p, "+/-", me, ", (", ci_lo, ",", ci_hi, ").")
## 95% confidence interval is 0.83 +/- 0.1 , ( 0.73 , 0.94 ).

When p ~ 0 or 1

If p ~ 0 or p ~ 1, then set the extreme end of the probability distribution to \((\alpha/2) ^{(1/n)}\).

A maintenance crew resolves y = 50 of n = 50 repair requests within 24 hours, a proportion of p = 1.00. With 95% confidence, what proportion of repair requests does the maintenance crew resolve within 24 hours?

n = 50
y = 50
z_crit = qnorm(p = 0.975, mean = 0, sd = 1)
ci_lo = (.05 / 2)^(1/n)
ci_hi = 1
cat("95% confidence interval is (", ci_lo, ",", ci_hi, ").")
## 95% confidence interval is ( 0.93 , 1 ).

Two Sample Proportion Comparison

The two-sample proportion test compares two sample proportions.