R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

library(ggplot2)

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
        geom_point()

Quick Fun Tips!!

*To execute a code you need to press Ctrl+Enter

*Check the version of R.

version
##                _                           
## platform       x86_64-w64-mingw32          
## arch           x86_64                      
## os             mingw32                     
## system         x86_64, mingw32             
## status                                     
## major          3                           
## minor          5.1                         
## year           2018                        
## month          07                          
## day            02                          
## svn rev        74947                       
## language       R                           
## version.string R version 3.5.1 (2018-07-02)
## nickname       Feather Spray

*Commands to check the packages available.

available.packages()

*Commmand to see the more details of a package.

library(caret)
## Loading required package: lattice
search()
##  [1] ".GlobalEnv"        "package:caret"     "package:lattice"  
##  [4] "package:ggplot2"   "package:stats"     "package:graphics" 
##  [7] "package:grDevices" "package:utils"     "package:datasets" 
## [10] "package:methods"   "Autoloads"         "package:base"

*List all the objects in the Namespace

ls()

*List of all the methods in the package and other details.

library(help = "caret")

*Search used to find a word search for accross packages and libraries.

??caret
??grid
??plot

*To get the inforamtion about the function.

?plot
?caret

*Apropos command for find the approximate match for the search term. Useful in case you faintly remember the spelling of any command.

apropos("createdata")
## [1] "createDataPartition"

2. Working directory.

*Get the working directory

getwd()
## [1] "C:/Users/preygupta/Documents/Predictive Analytics with R"

* Set the working directory

setwd("C:/Users/preygupta/Documents/spls")
setwd("C:/Users/preygupta/Documents/Predictive Analytics with R")

Typeof examples

b = c(1:4)
typeof(b)
## [1] "integer"
d = c(1,2,3,4)
typeof(d)
## [1] "double"
c = as.integer(c(1,5,7))
typeof(c)
## [1] "integer"
a <- c(1L,2L,3L,4L)
typeof(a)
## [1] "integer"

3.DATA STRUCTURES IN R

Homogeneous     Heterogeneous

1d Atomic vector List 2d Matrix Data frame nd Array

1. Vector

* A basic data structure of R containing the same type of data
* is.vector() does not test if an object is a vector. Instead it returns TRUE only if the object is a vector with no attributes apart from names. Use is.atomic(x) || is.list(x) to test if an object is actually a vector.
v = vector()
str(v)

logi(0)

typeof(v)

[1] “logical”

v = c()
str(v)

NULL

typeof(v)

[1] “NULL”

v = vector(NULL)
## Error in vector(NULL): invalid 'mode' argument
str(v)

NULL

typeof(v)

[1] “NULL”

mode returns a character string giving the (storage) mode of the object — often the same — both relying on the output of typeof(x)

Modes have the same set of names as types (see typeof) except that

types “integer” and “double” are returned as “numeric”.

types “special” and “builtin” are returned as “function”.

type “symbol” is called mode “name”.

type “language” is returned as “(” or “call”.

v = vector(NA)
## Error in vector(NA): vector: cannot make a vector of mode 'NA'.
str(v)
##  NULL
typeof(v)
## [1] "NULL"
v = NA
str(v)
##  logi NA
typeof(v)
## [1] "logical"
v = NULL
str(v)
##  NULL
typeof(v)
## [1] "NULL"
typeof(NA_integer_)
## [1] "integer"
str(NA_integer_)
##  int NA

NULL represents the null object in R: it is a reserved word. NULL is often returned by expressions and functions whose values are undefined. NULL is a special object. It is used whenever there is a need to indicate or specify that an object is absent. It should not be confused with a vector or list of zero length.

The NULL object has no type and no modifiable properties. There is only one NULL object in R, to which all instances refer. To test for NULL use is.null. You cannot set attributes on NULL.

NA is a logical constant of length 1 which contains a missing value indicator. NA can be freely coerced to any other vector type except raw. There are also constants NA_integer_, NA_real_, NA_complex_ and NA_character_ of the other atomic vector types which support missing values: all of these are reserved words in the R language.

NA NULL

NA [1] NA

class(NA) [1] “logical”

NA > 1 [1] NA

NULL NULL

class(NULL) [1] “NULL”

NULL > 1 logical(0)

Vector with NA and NULL

v <-  c( 1, NA, NULL)           # A vector ignores NULL
v 
## [1]  1 NA

List with NA and NULL( inc. data frames )

  list(1, NA, NULL)  # list does not ignore NULL
## [[1]]
## [1] 1
## 
## [[2]]
## [1] NA
## 
## [[3]]
## NULL
  # ask question here
#https://www.r-bloggers.com/r-na-vs-null/

Numbers

typeof(1)
## [1] "double"
typeof(1L)
## [1] "integer"
typeof(Inf)
## [1] "double"
class(1)
## [1] "numeric"
class(1L)
## [1] "integer"
class(Inf)
## [1] "numeric"
#https://stackoverflow.com/questions/35445112/what-is-the-difference-between-mode-and-class-in-r

Attributes:

All objects can have arbitrary additional attributes, used to store metadata about the object. Attributes can be thought of as a named list (with unique names). Attributes can be accessed individually with attr() or all at once (as a list) with attributes(). By default, most attributes are lost when modifying a vector.

The only attributes not lost are the three most important:

Names, a character vector giving each element a name. Syntax: names(x)

Dimensions, used to turn vectors into matrices and arrays. Syntax: dim(x)

Class, used to implement the S3 object system. Syntax : class(x)

y <- 1:10
attr(y, "my_attribute") <- "This is a vector"
attr(y, "my_attribute")
## [1] "This is a vector"
attributes(y)
## $my_attribute
## [1] "This is a vector"
e = c(3,9,4,6,7)
names(e) = c("q","w","e")


e[1]
## q 
## 3
e["q"]
## q 
## 3
attributes(e)
## $names
## [1] "q" "w" "e" NA  NA
dim(e) #vectors are unidimensional hence the dimension is NULL
## NULL
x <- c(a = 1, b = 2, c = 3)
print(x)
## a b c 
## 1 2 3
x <- setNames(1:3, c("a", "b", "c"))
print(x)
## a b c 
## 1 2 3
# You can create a new vector without names using unname(x), or remove names in place with names(x) <- NULL.
unname(x)
## [1] 1 2 3
names(e) = NULL
print(e) 
## [1] 3 9 4 6 7
attributes(pred_wage)
## Error in eval(expr, envir, enclos): object 'pred_wage' not found

NA and NaN

k = NA
is.na(k)
## [1] TRUE
a= 0/0
a
## [1] NaN
s = c("a", "b", a)

is.na(a)
## [1] TRUE
is.na(s)
## [1] FALSE FALSE FALSE
is.nan(a)
## [1] TRUE
which(s == "NaN")
## [1] 3
#K == NA
#Questions and notes:-
 c(1, c(2, c(3, 4)))
## [1] 1 2 3 4
####Given a vector, you can determine its type with typeof(), or check if it’s a specific type with an “is”      function: is.character(), is.double(), is.integer(), is.logical(), or, more generally, is.atomic().  

int_var <- c(1L, 6L, 10L)
typeof(int_var)
## [1] "integer"
is.integer(int_var)
## [1] TRUE
is.atomic(int_var)
## [1] TRUE
#### is.numeric() is a general test for the “numberliness” of a vector and returns TRUE for both integer and double vectors.

is.numeric(int_var)
## [1] TRUE

2. Factors

One important use of attributes is to define factors. A factor is a vector that can contain only predefined values, and is used to store categorical data. Factors are built on top of integer vectors using two attributes: the class, “factor”, which makes them behave differently from regular integer vectors, and the levels, which defines the set of allowed values.

x <- factor(c("a", "b", "b", "a"))
x
## [1] a b b a
## Levels: a b
class(x)
## [1] "factor"
typeof(x)
## [1] "integer"
levels(x)
## [1] "a" "b"
class(x[1])
## [1] "factor"
typeof(x[1])
## [1] "integer"
x[1]
## [1] a
## Levels: a b
# While factors look (and often behave) like character vectors, they are actually integers. Be careful when treating them like strings. Some string methods (like gsub() and grepl()) will coerce factors to strings, while others (like nchar()) will throw an error, and still others (like c()) will use the underlying integer values. For this reason, it’s usually best to explicitly convert factors to character vectors if you need string-like behaviour.


# You can't use values that are not in the levels
x[2] <- "c"
## Warning in `[<-.factor`(`*tmp*`, 2, value = "c"): invalid factor level, NA
## generated
print(x)
## [1] a    <NA> b    a   
## Levels: a b
# NB: you can't combine factors
c(factor("a"), factor("b"))
## [1] 1 1

Factors are useful when you know the possible values a variable may take, even if you don’t see all values in a given dataset. Using a factor instead of a character vector makes it obvious when some groups contain no observations:

sex_char <- c("m", "m", "m")
sex_factor <- factor(sex_char, levels = c("m", "f"))

table(sex_char)
## sex_char
## m 
## 3
table(sex_factor)
## sex_factor
## m f 
## 3 0

Sometimes when a data frame is read directly from a file, a column you’d thought would produce a numeric vector instead produces a factor. This is caused by a non-numeric value in the column, often a missing value encoded in a special way like . or -. To remedy the situation, coerce the vector from a factor to a character vector, and then from a character to a double vector. (Be sure to check for missing values after this process.) Of course, a much better plan is to discover what caused the problem in the first place and fix that; using the na.strings argument to read.csv() is often a good place to start.

z <- read.csv(text = "value\n12\n1\n.\n9\n1\n9")
z
##   value
## 1    12
## 2     1
## 3     .
## 4     9
## 5     1
## 6     9
class(z)
## [1] "data.frame"
typeof(z)
## [1] "list"
class(z$value)
## [1] "factor"
typeof(z$value)
## [1] "integer"
levels(z$value)
## [1] "."  "1"  "12" "9"
as.character(z$value)
## [1] "12" "1"  "."  "9"  "1"  "9"
as.double(z$value)  # absurd value
## [1] 3 2 1 4 2 4
as.double(as.character(z$value))
## Warning: NAs introduced by coercion
## [1] 12  1 NA  9  1  9
z$value = as.double(as.character(z$value))
## Warning: NAs introduced by coercion
print(z)
##   value
## 1    12
## 2     1
## 3    NA
## 4     9
## 5     1
## 6     9
class(z$value)
## [1] "numeric"
z <- read.csv(text = "value\n12\n1\n.\n9\n1\n9", na.strings=".")
typeof(z$value)
## [1] "integer"
class(z$value)
## [1] "integer"
print(z)
##   value
## 1    12
## 2     1
## 3    NA
## 4     9
## 5     1
## 6     9

Unfortunately, most data loading functions in R automatically convert character vectors to factors. This is suboptimal, because there’s no way for those functions to know the set of all possible levels or their optimal order. Instead, use the argument stringsAsFactors = FALSE to suppress this behaviour, and then manually convert character vectors to factors using your knowledge of the data.

Manipulating levels in a factor.

f3 <- factor(letters, levels = rev(letters))
f2 <- rev(factor(letters))
levels(f1) = c(levels(f1), "qq")
## Error in levels(f1): object 'f1' not found
levels(f1) = c(levels(f1), 1)
## Error in levels(f1): object 'f1' not found
addNoAnswer <- function(x){
  if(is.factor(x)) return(factor(x, levels=c(levels(x), "No Answer")))
  return(x)
}

df <- as.data.frame(lapply(df, addNoAnswer))
## Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : argument is missing, with no default
#test it later

3. LISTS

#### List is the object which Contains elements of different types – like strings, numbers, vectors and another list inside it. R list can also contain a matrix or a function as its elements. The List is been created using list() Function in R. In other words, a list is a generic vector containing other objects.

Lists are sometimes called recursive vectors, because a list can contain other lists. This makes them fundamentally different from atomic vectors. Atomic vectors are flat.

x <- list(list(list(list())))
str(x)
## List of 1
##  $ :List of 1
##   ..$ :List of 1
##   .. ..$ : list()
is.recursive(x)
## [1] TRUE
x <- list(list(1, 2), c(3, 4))
y <- c(list(1, 2), c(3, 4))
str(x)
## List of 2
##  $ :List of 2
##   ..$ : num 1
##   ..$ : num 2
##  $ : num [1:2] 3 4
str(y)
## List of 4
##  $ : num 1
##  $ : num 2
##  $ : num 3
##  $ : num 4
class(y)
## [1] "list"

c() will combine several lists into one. If given a combination of atomic vectors and lists, c() will coerce the vectors to lists before combining them.

The typeof() a list is list. You can test for a list with is.list() and coerce to a list with as.list(). You can turn a list into an atomic vector with unlist(). If the elements of a list have different types, unlist() uses the same coercion rules as c().

Lists are used to build up many of the more complicated data structures in R. For example, both data frames (described in data frames) and linear models objects (as produced by lm()) are lists:

x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
typeof(x)
## [1] "list"
colnames(x)
## NULL
y = unlist(x)
y
## [1] "1"     "2"     "3"     "a"     "TRUE"  "FALSE" "TRUE"  "2.3"   "5.9"
typeof(y)
## [1] "character"
y = as.vector(x)
y
## [[1]]
## [1] 1 2 3
## 
## [[2]]
## [1] "a"
## 
## [[3]]
## [1]  TRUE FALSE  TRUE
## 
## [[4]]
## [1] 2.3 5.9
typeof(y)
## [1] "list"

Lists are used to build up many of the more complicated data structures in R. For example, both data frames (described in data frames) and linear models objects (as produced by lm()) are lists.

Accessing list.

n = c(2, 3, 5)
s =  c("aa",  "bb",  "cc")
b = c(TRUE,  FALSE,  TRUE,  FALSE,  FALSE )
x = list( n, s, b, 3)    # x contains copies of n, s, b


#Naming List
list_data <- list(c("Feb","Mar","Apr"), matrix(c(3,9,5,1,-2,8), nrow = 2),   list("green",12.3))
print(list_data)
## [[1]]
## [1] "Feb" "Mar" "Apr"
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]    3    5   -2
## [2,]    9    1    8
## 
## [[3]]
## [[3]][[1]]
## [1] "green"
## 
## [[3]][[2]]
## [1] 12.3
names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")
print(list_data)
## $`1st Quarter`
## [1] "Feb" "Mar" "Apr"
## 
## $A_Matrix
##      [,1] [,2] [,3]
## [1,]    3    5   -2
## [2,]    9    1    8
## 
## $`A Inner list`
## $`A Inner list`[[1]]
## [1] "green"
## 
## $`A Inner list`[[2]]
## [1] 12.3
list_data$A_Matrix
##      [,1] [,2] [,3]
## [1,]    3    5   -2
## [2,]    9    1    8
list_data$"A_Matrix"
##      [,1] [,2] [,3]
## [1,]    3    5   -2
## [2,]    9    1    8
list_data$'A_Matrix'
##      [,1] [,2] [,3]
## [1,]    3    5   -2
## [2,]    9    1    8
list_data[[2]]
##      [,1] [,2] [,3]
## [1,]    3    5   -2
## [2,]    9    1    8

4. MATRICES AND ARRAYS

Adding a dim attribute to an atomic vector allows it to behave like a multi-dimensional array. A special case of the array is the matrix, which has two dimensions. Matrices are used commonly as part of the mathematical machinery of statistics. The column is filled first.

m = c(1:24)
print(m)
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24
dim(m) = c(5,5)
## Error in dim(m) = c(5, 5): dims [product 25] do not match the length of object [24]
dim(m) = c(6,4)
m
##      [,1] [,2] [,3] [,4]
## [1,]    1    7   13   19
## [2,]    2    8   14   20
## [3,]    3    9   15   21
## [4,]    4   10   16   22
## [5,]    5   11   17   23
## [6,]    6   12   18   24
class(m)
## [1] "matrix"
typeof(m)
## [1] "integer"
dim(m) = c(2,3,4)
m
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12
## 
## , , 3
## 
##      [,1] [,2] [,3]
## [1,]   13   15   17
## [2,]   14   16   18
## 
## , , 4
## 
##      [,1] [,2] [,3]
## [1,]   19   21   23
## [2,]   20   22   24
class(m)
## [1] "array"
typeof(m)
## [1] "integer"
# Two scalar arguments to specify rows and columns
a <- matrix(1:6, ncol = 3, nrow = 2)

print(a)
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
ncol(a)
## [1] 3
nrow(a)
## [1] 2
dim(a)
## [1] 2 3
# One vector argument to describe all dimensions
b <- array(1:12, c(2, 3, 2))

print(b)
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12
ncol(b)
## [1] 3
nrow(b)
## [1] 2
dim(b)
## [1] 2 3 2
rownames(a) <- c("A", "B")
colnames(a) <- c("a", "b", "c")
a
##   a b c
## A 1 3 5
## B 2 4 6
dimnames(b) <- list(c("one", "two"), c("a", "b", "c"), c("A", "B"))
b
## , , A
## 
##     a b c
## one 1 3 5
## two 2 4 6
## 
## , , B
## 
##     a  b  c
## one 7  9 11
## two 8 10 12

Vectors are not the only 1-dimensional data structure. You can have matrices with a single row or single column, or arrays with a single dimension. They may print similarly, but will behave differently. The differences aren’t too important, but it’s useful to know they exist in case you get strange output from a function (tapply() is a frequent offender).As always, use str() to reveal the differences.

q = matrix(1:3, ncol = 1)
print(q)
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3
q = array(1:3, c(3))
print(q)
## [1] 1 2 3

5. Data Frame

(Most commonly used) A data frame is a list of equal-length vectors. This makes it a 2-dimensional structure, so it shares properties of both the matrix and the list.

df <- data.frame(x = 1:3, y = c("a", "b", "c"))
df
##   x y
## 1 1 a
## 2 2 b
## 3 3 c
str(df)
## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3

Beware data.frame()’s default behaviour which turns strings into factors. Use stringAsFactors = FALSE to suppress this behaviour

df <- data.frame(
x = 1:3,
y = c("a", "b", "c"),
stringsAsFactors = FALSE)
str(df)
## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: chr  "a" "b" "c"
typeof(df)
## [1] "list"
class(df)
## [1] "data.frame"
is.data.frame(df)
## [1] TRUE

Coercing other data structures into a data frame.

use as.data.frame()

v = c(1:5)
as.data.frame(v)
##   v
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
l = list(a = 1:3,b= c("a","b"))
print(l)
## $a
## [1] 1 2 3
## 
## $b
## [1] "a" "b"
as.data.frame(l)
## Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 3, 2
l = list(a = 1:2,b= c("a","b"))
as.data.frame(l)
##   a b
## 1 1 a
## 2 2 b
print(l)
## $a
## [1] 1 2
## 
## $b
## [1] "a" "b"

Combining data frames

You can combine data frames using cbind() and rbind():

cbind(df, data.frame(z = 3:1))
##   x y z
## 1 1 a 3
## 2 2 b 2
## 3 3 c 1
cbind(df, data.frame(z = 3:2))  # cbinding with unequal lengths
## Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 2
library(rowr)
cbind.fill(df, data.frame(z = 3:2), fill = NA)
##   x y  z
## 1 1 a  3
## 2 2 b  2
## 3 3 c NA
dplyr::full_join(df, data.frame(x = 4:5))
## Joining, by = "x"
##   x    y
## 1 1    a
## 2 2    b
## 3 3    c
## 4 4 <NA>
## 5 5 <NA>
rbind(df, data.frame(10,"z"))
## Error in match.names(clabs, names(xi)): names do not match previous names
rbind(df, data.frame(x = 10, y = "z"))
##    x y
## 1  1 a
## 2  2 b
## 3  3 c
## 4 10 z
plyr::rbind.fill(df, data.frame(10))              # rbinding with unequal lengths
##    x    y X10
## 1  1    a  NA
## 2  2    b  NA
## 3  3    c  NA
## 4 NA <NA>  10
plyr::rbind.fill(df, data.frame(x = 20))
##    x    y
## 1  1    a
## 2  2    b
## 3  3    c
## 4 20 <NA>
# When combining column-wise, the number of rows must match, but row names are ignored. When combining row-wise, both the number and names of columns must match. Use plyr::rbind.fill() to combine
# data frames that don’t have the same columns

It’s a common mistake to try and create a data frame by cbind()ing vectors together. Instead use data.frame() directly

good <- data.frame(a = 1:2, b = c("a", "b"),
stringsAsFactors = FALSE)
str(good)
## 'data.frame':    2 obs. of  2 variables:
##  $ a: int  1 2
##  $ b: chr  "a" "b"
good
##   a b
## 1 1 a
## 2 2 b
bad <- data.frame(cbind(a = 1:2, b = c("a", "b")))
str(bad)
## 'data.frame':    2 obs. of  2 variables:
##  $ a: Factor w/ 2 levels "1","2": 1 2
##  $ b: Factor w/ 2 levels "a","b": 1 2

The conversion rules for cbind() are complicated and best avoided by ensuring all inputs are of the same type.

Special Columns

Since a data frame is a list of vectors, it is possible for a data frame to have a column that is a list:

df <- data.frame(x = 1:3)
df$y = c(4:6)
df[,3]=data.frame(z = c(7:9))
df$y <- list(1:2, 1:3, 1:4)
df
##   x          y z
## 1 1       1, 2 7
## 2 2    1, 2, 3 8
## 3 3 1, 2, 3, 4 9
str(df)
## 'data.frame':    3 obs. of  3 variables:
##  $ x: int  1 2 3
##  $ y:List of 3
##   ..$ : int  1 2
##   ..$ : int  1 2 3
##   ..$ : int  1 2 3 4
##  $ z: int  7 8 9
dfn = data.frame(x = 1:3, y = list(1:2, 1:3, 1:4))
## Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 2, 3, 4
dfn = data.frame(x = 1:3, y = list(1:3))
str(dfn)
## 'data.frame':    3 obs. of  2 variables:
##  $ x   : int  1 2 3
##  $ X1.3: int  1 2 3

A workaround is to use I(), which causes data.frame() to treat the list as one unit:

dfl <- data.frame(x = 1:3, y = I(list(1:2, 1:3, 1:4)))
str(dfl)
## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y:List of 3
##   ..$ : int  1 2
##   ..$ : int  1 2 3
##   ..$ : int  1 2 3 4
##   ..- attr(*, "class")= chr "AsIs"
dfl[3, "y"]
## [[1]]
## [1] 1 2 3 4
dfm <- data.frame(x = 1:3, y = I(matrix(1:9, nrow = 3)))
str(dfm)
## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: 'AsIs' int [1:3, 1:3] 1 2 3 4 5 6 7 8 9
dfm
##   x y.1 y.2 y.3
## 1 1   1   4   7
## 2 2   2   5   8
## 3 3   3   6   9
dfm[2, "y"]
##      [,1] [,2] [,3]
## [1,]    2    5    8

Use list and array columns with caution: many functions that work with data frames assume that all columns are atomic vectors.

4. Reading Data

R will automatically:-

  1. skip lines that begin with a #
  2. ???gure out how many rows there are (and how much memory needs to be allocated)
  3. ???gure what type of variable is in each column of the table Telling R all these things directly makes
  4. R run faster and more ef???ciently. read.csv is identical to read.table except that the default separator is a comma.

Note:- * read.xlsx(“filename.xlsx”, 1) reads your file and makes the data.frame column classes nearly useful, but is very slow for large data sets.

  • read.xlsx2(“filename.xlsx”, 1) is faster, but you will have to define column classes manually.
inputData = read.table(file = "C:/Users/preygupta/Documents/Predictive Analytics with R/train.csv", header = TRUE, sep = "," )
head(inputData)
##   X.raw.error             X         X.1   X.2      X.3       X.4  X.5
## 1          ID      datetime temperature  var1 pressure windspeed var2
## 2           0 7/1/2013 0:00           0     0        0    571.91    A
## 3           1 7/1/2013 1:00       -12.1 -19.3      996    575.04    A
## 4           2 7/1/2013 2:00       -12.9   -20     1000   578.435    A
## 5           3 7/1/2013 3:00       -11.4 -17.1      995    582.58    A
## 6           4 7/1/2013 4:00       -11.4 -19.3     1005     586.6    A
##                       X.6
## 1 electricity_consumption
## 2                     216
## 3                     210
## 4                     225
## 5                     216
## 6                     222
str(inputData$var2)
##  NULL
levels(inputData$var2)
## NULL
inputData = read.table(file = "C:/Users/preygupta/Documents/Predictive Analytics with R/train.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE )

levels(inputData$var2)
## NULL
str(inputData$var2)
##  NULL
system.time(read.table(file = "C:/Users/preygupta/Documents/Predictive Analytics with R/train.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE ))
##    user  system elapsed 
##    0.07    0.00    0.07
system.time(read.csv2(file = "C:/Users/preygupta/Documents/Predictive Analytics with R/train.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE ))
##    user  system elapsed 
##    0.03    0.01    0.05
inputData = read.table(file = "C:/Users/preygupta/Documents/Predictive Analytics with R/train.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE, comment.char = "^", nrows = 10 )

Dump

ch = c("a","d")
dump(list = c("ch","t", "inputData"), file = "test1.R", append = FALSE, envir = parent.frame(), evaluate = TRUE)

#remove ch from enviroment and then execute it 
source("test1.R")

Dput

y <- data.frame(a = 1, b = "a")
dput(y)
## structure(list(a = 1, b = structure(1L, .Label = "a", class = "factor")), class = "data.frame", row.names = c(NA, 
## -1L))

Interfaces to the Outside World

Data are read in using connection interfaces. Connections can be made to ???les (most common) or to other more exotic things.

  • file, opens a connection to a ???le
  • gzfile, opens a connection to a ???le compressed with gzip bzfile,
  • opens a connection to a ???le compressed with bzip2
  • url, opens a connection to a webpage
con <- file("C:/Users/preygupta/Documents/Predictive Analytics with R/train.csv", "r") 
data <- read.csv(con) 
head(data)
##   X.raw.error             X         X.1   X.2      X.3       X.4  X.5
## 1          ID      datetime temperature  var1 pressure windspeed var2
## 2           0 7/1/2013 0:00           0     0        0    571.91    A
## 3           1 7/1/2013 1:00       -12.1 -19.3      996    575.04    A
## 4           2 7/1/2013 2:00       -12.9   -20     1000   578.435    A
## 5           3 7/1/2013 3:00       -11.4 -17.1      995    582.58    A
## 6           4 7/1/2013 4:00       -11.4 -19.3     1005     586.6    A
##                       X.6
## 1 electricity_consumption
## 2                     216
## 3                     210
## 4                     225
## 5                     216
## 6                     222
str(data)
## 'data.frame':    26497 obs. of  8 variables:
##  $ X.raw.error: Factor w/ 26497 levels "0","1","10","100",..: 26497 1 2 8425 16656 21439 22326 23109 23928 24675 ...
##  $ X          : Factor w/ 26497 levels "1/1/2014 0:00",..: 26497 19873 19874 19885 19890 19891 19892 19893 19894 19895 ...
##  $ X.1        : Factor w/ 61 levels "-0.7","-1.4",..: 61 25 6 7 5 5 4 8 5 4 ...
##  $ X.2        : Factor w/ 72 levels "-0.7","-1.4",..: 72 45 16 19 13 16 16 13 14 15 ...
##  $ X.3        : Factor w/ 75 levels "0","1000","1001",..: 75 1 71 2 70 7 15 8 72 14 ...
##  $ X.4        : Factor w/ 5604 levels "1.075","1.2",..: 5604 4357 4358 4359 4406 4407 1482 3337 4880 393 ...
##  $ X.5        : Factor w/ 4 levels "A","B","C","var2": 4 1 1 1 1 1 1 1 1 1 ...
##  $ X.6        : Factor w/ 253 levels "1002","1059",..: 253 30 28 33 30 32 30 31 32 31 ...
close(con)

readLines

readLines can be useful for reading in lines of webpages

## This might take time 
con <- url("http://www.jhsph.edu", "r")
x <- readLines(con) 
head(x)
## [1] "<!DOCTYPE html>"                                               
## [2] "<html lang=\"en\">"                                            
## [3] ""                                                              
## [4] "<head>"                                                        
## [5] "<meta charset=\"utf-8\" />"                                    
## [6] "<title>Johns Hopkins Bloomberg School of Public Health</title>"

5. Subsetting

R’s subsetting operators are powerful and fast. Mastery of subsettingallows you to succinctly express complex operations in a way that few other languages can match.

There are a number of operators that can be used to extract subsets of R objects.

  • [ always returns an object of the same class as the original; can be used to select more than one element (there is one exception)
  • [[ is used to extract elements of a list or a data frame; it can only be used to extract a single element and the class of the returned object will not necessarily be a list or data frame
  • $ is used to extract elements of a list or data frame by name; semantics are similar to that of [[.

Atomic Vectors

  x <- c(2.1, 4.2, 3.3, 5.4)
# subsetting with +ve integer.

x[c(3, 1)]
## [1] 3.3 2.1
x[order(x)]
## [1] 2.1 3.3 4.2 5.4
x[c(2.1, 2.9)]
## [1] 4.2 4.2
# Negative integers omit elements at the specified positions:
x[-c(3, 1)]
## [1] 4.2 5.4
### You can’t mix positive and negative integers in a single subset.
x[c(-1, 2)]
## Error in x[c(-1, 2)]: only 0's may be mixed with negative subscripts
# Logical vectors select elements where the corresponding logical value is TRUE.
x[c(TRUE, TRUE, FALSE, FALSE)]
## [1] 2.1 4.2
x[x > 3]
## [1] 4.2 3.3 5.4
### If the logical vector is shorter than the vector being subsetted, it will be recycled to be the same length.
x[c(TRUE, FALSE)]
## [1] 2.1 3.3
x[c(TRUE, FALSE, TRUE, FALSE)]
## [1] 2.1 3.3
### A missing value in the index always yields a missing value in the output
x[c(TRUE, TRUE, NA, FALSE)]
## [1] 2.1 4.2  NA
# Nothing returns the original vector. This is not useful for vectors but is very useful for matrices, data frames, and arrays.
x[]
## [1] 2.1 4.2 3.3 5.4
# Character vectors
y = setNames(x, letters[1:4])
y[c("d", "c", "a")]
##   d   c   a 
## 5.4 3.3 2.1
############ QUick notes. Useful tips.
letters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
LETTERS
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
month.abb
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"
month.name
##  [1] "January"   "February"  "March"     "April"     "May"      
##  [6] "June"      "July"      "August"    "September" "October"  
## [11] "November"  "December"
############

# When subsetting with [ names are always matched exactly
z <- c(abc = 1, def = 2)
z[c("a", "d")]
## <NA> <NA> 
##   NA   NA

Subsetting a list works in the same way as subsetting an atomic vector. Using [ will always return a list; [[ and $, will be discussed in detail soon.

Matrices and arrays

a <- matrix(10:18, nrow = 3)
colnames(a) = c("A", "B", "C")
print(a)
##       A  B  C
## [1,] 10 13 16
## [2,] 11 14 17
## [3,] 12 15 18
a[1:2, ]
##       A  B  C
## [1,] 10 13 16
## [2,] 11 14 17
class(a[1:2, ])    # first is row and then is column so it is basically [row,column]
## [1] "matrix"
a[c(4,7)]         # accessing row wise
## [1] 13 16
a[c(T, F, T), c("B", "A")]
##       B  A
## [1,] 13 10
## [2,] 15 12
a[0,]
##      A B C
class(a[0,-2])
## [1] "matrix"
class(a[0,])    # By default, [ will simplify the results to the lowest possible dimensionality
## [1] "matrix"

Because matrices and arrays are implemented as vectors with special attributes, you can subset them with a single vector. In that case, they will behave like a vector. Arrays and Matrices in R are stored in column-major order:

(vals <- outer(1:5, 1:5, FUN = "paste", sep = ","))
##      [,1]  [,2]  [,3]  [,4]  [,5] 
## [1,] "1,1" "1,2" "1,3" "1,4" "1,5"
## [2,] "2,1" "2,2" "2,3" "2,4" "2,5"
## [3,] "3,1" "3,2" "3,3" "3,4" "3,5"
## [4,] "4,1" "4,2" "4,3" "4,4" "4,5"
## [5,] "5,1" "5,2" "5,3" "5,4" "5,5"
class(vals)
## [1] "matrix"
vals[c(4,15)]     # accessing a 2-d matrix using single index
## [1] "4,1" "5,3"
vals[15]   
## [1] "5,3"
select <- matrix(ncol = 2, byrow = TRUE, c(         # accessing a matrix with a matrix of indicies.Similarly we can do this forarray 
1, 1,
3, 1,
2, 4
))
vals[select]
## [1] "1,1" "3,1" "2,4"
vals[[2,2]]
## [1] "2,2"
class(vals[[2,2]])
## [1] "character"

Data frames

Data frames possess the characteristics of both lists and matrices: if you subset with a single vector, they behave like lists; if you subset with two vectors, they behave like matrices.

df <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])
str(df)
## 'data.frame':    3 obs. of  3 variables:
##  $ x: int  1 2 3
##  $ y: int  3 2 1
##  $ z: Factor w/ 3 levels "a","b","c": 1 2 3
df$x == 2
## [1] FALSE  TRUE FALSE
df[df$x == 2, ]
##   x y z
## 2 2 2 b
df[c(1, 3), ]
##   x y z
## 1 1 3 a
## 3 3 1 c
# There are two ways to select columns from a data frame

# Like a list(same as a vector):
df[c("x", "z")]
##   x z
## 1 1 a
## 2 2 b
## 3 3 c
str(df[c("x", "z")])
## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ z: Factor w/ 3 levels "a","b","c": 1 2 3
# Like a matrix
df[, c("x", "z")]
##   x z
## 1 1 a
## 2 2 b
## 3 3 c
str(df[, c("x", "z")])
## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ z: Factor w/ 3 levels "a","b","c": 1 2 3
df["x"]
##   x
## 1 1
## 2 2
## 3 3
str(df["x"])
## 'data.frame':    3 obs. of  1 variable:
##  $ x: int  1 2 3
df[,"x"]
## [1] 1 2 3
str(df[,"x"])
##  int [1:3] 1 2 3
# There's an important difference if you select a single
# column: matrix subsetting simplifies by default, list
# subsetting does not.

Subsetting operators $ and [[

useful for lists and data frames.

There are two other subsetting operators: [[ and $. [[ is similar to [, except it can only return a single value and it allows you to pull pieces out of a list. $ is a useful shorthand for [[ combined with character subsetting.

You need [[ when working with lists. This is because when [ is applied to a list it always returns a list: it never gives you the contents of the list. To get the contents, you need [[

Because it can return only a single value, you must use [[ with either a single positive integer or a string

x <- list(foo = 1:4, bar = 0:6)
x[1]
## $foo
## [1] 1 2 3 4
class(x[1])
## [1] "list"
x[[1]]
## [1] 1 2 3 4
class(x[[1]])
## [1] "integer"
x$foo
## [1] 1 2 3 4
class(x$foo)
## [1] "integer"
x$"foo"
## [1] 1 2 3 4
class(x$"foo")
## [1] "integer"
x$'foo'
## [1] 1 2 3 4
class(x$'foo')
## [1] "integer"
x[[2]][c(1:3)]
## [1] 0 1 2
x$foo[3]
## [1] 3
x$bar[c(3:5)]
## [1] 2 3 4
x["bar"]
## $bar
## [1] 0 1 2 3 4 5 6
class(x["bar"])
## [1] "list"
x[["bar"]]
## [1] 0 1 2 3 4 5 6
class(x[["bar"]])
## [1] "integer"
x <- list(foo = 1:4, bar = 0:6, baz = "hello")
x[c(1, 3)]
## $foo
## [1] 1 2 3 4
## 
## $baz
## [1] "hello"
str(x)
## List of 3
##  $ foo: int [1:4] 1 2 3 4
##  $ bar: int [1:7] 0 1 2 3 4 5 6
##  $ baz: chr "hello"
x[c("foo","baz")]
## $foo
## [1] 1 2 3 4
## 
## $baz
## [1] "hello"
x <- list(a = list(10,11,12), b = c(3.14, 2.81))
x[c(1,3)]
## $a
## $a[[1]]
## [1] 10
## 
## $a[[2]]
## [1] 11
## 
## $a[[3]]
## [1] 12
## 
## 
## $<NA>
## NULL
class(x[c(1,3)])
## [1] "list"
x[[c(1,3)]]
## [1] 12
class(x[[c(1,3)]])
## [1] "numeric"
str(x)
## List of 2
##  $ a:List of 3
##   ..$ : num 10
##   ..$ : num 11
##   ..$ : num 12
##  $ b: num [1:2] 3.14 2.81

more examples

 x <- list(foo = 1:4, bar = 0.6, baz = "hello") 
 name <- "foo" 
 x[[name]]  ## computed index for ‘foo’
## [1] 1 2 3 4
 x$name     ## element ‘name’ doesn’t exist! NULL 
## NULL
 x$foo
## [1] 1 2 3 4
 x <- c("a", "b", "c", "c", "d", "a")

x[1]
## [1] "a"
class(x[1])
## [1] "character"
x[1:4]
## [1] "a" "b" "c" "c"
x[x>"a"]
## [1] "b" "c" "c" "d"
u <- x > "a" 
x[u]
## [1] "b" "c" "c" "d"
#Because data frames are lists of columns, you can use [[ to extract a column from data frames: 
 mtcars[[1]]
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
 mtcars[["cyl"]]
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4

Simplifying vs. preserving subsetting

It’s important to understand the distinction between simplifying and preserving subsetting. Simplifying subsets returns the simplest possible data structure that can represent the output, and is useful interactively because it usually gives you what you want. Preserving subsetting keeps the structure of the output the same as the input, and is generally better for programming because the result will always be the same type. Omitting drop = FALSE when subsetting matrices and data frames is one of the most common sources of programming errors.

              Simplifying                     Preserving

Vector x[[1]] x[1] List x[[1]] x[1] Factor x[1:4, drop = T] x[1:4] Array x[1, ] or x[, 1] x[1, , drop = F] or x[, 1, drop = F] Data frame x[, 1] or x[[1]] x[, 1, drop = F] or x[1]

Preserving is the same for all data types: you get the same type of output as input.

Simplifying behaviour varies slightly between different data types, as described below:

Atomic vectors: removes names

x <- c(a = 1, b = 2)
x[1]
## a 
## 1
x[[1]]
## [1] 1

List: return the object inside the list, not a single element list.

y <- list(a = 1, b = 2)
str(y[1])
## List of 1
##  $ a: num 1
str(y[[1]])
##  num 1

Factor: drops any unused levels.

z <- factor(c("a", "b", "a", "b", "a", "b", "a", "b"))
z[1]
## [1] a
## Levels: a b
z[1, drop = TRUE]
## [1] a
## Levels: a
z[c(1,2), drop = TRUE]
## [1] a b
## Levels: a b

Matrix or array: if any of the dimensions has length 1, drops that dimension.

a <- matrix(1:4, nrow = 2)
a[1, , drop = FALSE]
##      [,1] [,2]
## [1,]    1    3
class(a[1, , drop = FALSE])
## [1] "matrix"
a[1,]
## [1] 1 3
class(a[1,])
## [1] "integer"

Data frame: if output is a single column, returns a vector instead of a data frame.

df <- data.frame(a = 1:2, b = 1:2)

str(df[1])
## 'data.frame':    2 obs. of  1 variable:
##  $ a: int  1 2
class(df[1])
## [1] "data.frame"
typeof(df[1])
## [1] "list"
str(df[[1]])
##  int [1:2] 1 2
class(df[[1]])
## [1] "integer"
str(df[, "a", drop = FALSE])
## 'data.frame':    2 obs. of  1 variable:
##  $ a: int  1 2
str(df[, "a"])
##  int [1:2] 1 2

$

$ is a shorthand operator, where x$y is equivalent to x[[“y”, exact = FALSE]].

One common mistake with $ is to try and use it when you have the name of a column stored in a variable

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
var <- "cyl"

mtcars$var
## NULL
  mtcars[[var]]
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
  mtcars$cyl
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4

There’s one important difference between $ and [[. $ does partial matching

x <- list(abc = 1, b=2)
x$a
## [1] 1
x[["a"]]
## NULL

Missing/out of bounds indices

The following table summarises the results of subsetting atomic vectors and lists with [ and [[ and different types of OOB value.

Operator Index Atomic List [ OOB NA list(NULL) [ NA_real_ NA list(NULL) [ NULL x[0] list(NULL) [[ OOB Error Error [[ NA_real_ Error NULL [[ NULL Error Error

If the input vector is named, then the names of OOB, missing, or NULL components will be “

Subsetting and assignment

x <- 1:5
x[-1] <- 4:1

mtcars[] <- lapply(mtcars, as.integer)
mtcars <- lapply(mtcars, as.integer)

With lists, you can use subsetting + assignment + NULL to remove components from a list. To add a literal NULL to a list, use [ and list(NULL)

x <- list(a = 1, b = 2)
x[["b"]] <- NULL
str(x)
## List of 1
##  $ a: num 1
x <- list(a = 1, b = 2)
x["b"] <- list(NULL)
str(x)
## List of 2
##  $ a: num 1
##  $ b: NULL
y <- list(a = 1)
y["b"] <- list(NULL)
str(y)
## List of 2
##  $ a: num 1
##  $ b: NULL

Removing NA values

 x <- c(1, 2, NA, 4, NA, 5) 
 bad <- is.na(x) 
 x = x[!bad]
dim(airquality)
## [1] 153   6
airquality[1:6, ]
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
good <- complete.cases(airquality)
airquality[good, ][1:6, ]
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 7    23     299  8.6   65     5   7
## 8    19      99 13.8   59     5   8

Apply family functions

lapply: Loop over a list and evaluate a function on each element sapply: Same as lapply but try to simplify the result apply: Apply a function over the margins of an array tapply: Apply a function over subsets of a vector mapply: Multivariate version of lapply An auxiliary function split is also useful, particularly in conjunction with lapply.

lapply

lapply takes three arguments: (1) a list X; (2) a function (or the name of a function) FUN; (3) other arguments via its … argument. If X is not a list, it will be coerced to a list using as.list.

lapply always returns a list, regardless of the class of the input.

x <- list(a = 1:5, b = rnorm(10)) 
lapply(x, mean)
## $a
## [1] 3
## 
## $b
## [1] 0.3864866
x <- list(a = matrix(1:4, 2, 2), b = matrix(1:6, 3, 2))
lapply(x, function(elt) elt[,1])
## $a
## [1] 1 2
## 
## $b
## [1] 1 2 3

sapply

sapply will try to simplify the result of lapply if possible.

  • If the result is a list where every element is length 1, then a vector is returned
  • If the result is a list where every element is a vector of the same length (> 1), a matrix is returned.
  • If it can’t ???gure things out, a list is returned
x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))
lapply(x, mean) 
## $a
## [1] 2.5
## 
## $b
## [1] -0.2139658
## 
## $c
## [1] 0.6853578
## 
## $d
## [1] 4.839964
sapply(x, mean)
##          a          b          c          d 
##  2.5000000 -0.2139658  0.6853578  4.8399644

Apply

It is most often used to apply a function to the rows or columns of a matrix It can be used with general arrays, e.g. taking the average of an array of matrices It is not really faster than writing a loop, but it works in one line!

str(apply) 
## function (X, MARGIN, FUN, ...)

X is an array MARGIN is an integer vector indicating which margins should be “retained”. FUN is a function to be applied … is for other arguments to be passed to FUN

x <- matrix(1:9, 3, 3)
apply(x, 2, mean)
## [1] 2 5 8
apply(x, 1, mean) 
## [1] 4 5 6
# is for row and 2 is for column