This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.
library(ggplot2)
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point()
version
## _
## platform x86_64-w64-mingw32
## arch x86_64
## os mingw32
## system x86_64, mingw32
## status
## major 3
## minor 5.1
## year 2018
## month 07
## day 02
## svn rev 74947
## language R
## version.string R version 3.5.1 (2018-07-02)
## nickname Feather Spray
available.packages()
library(caret)
## Loading required package: lattice
search()
## [1] ".GlobalEnv" "package:caret" "package:lattice"
## [4] "package:ggplot2" "package:stats" "package:graphics"
## [7] "package:grDevices" "package:utils" "package:datasets"
## [10] "package:methods" "Autoloads" "package:base"
ls()
library(help = "caret")
??caret
??grid
??plot
?plot
?caret
apropos("createdata")
## [1] "createDataPartition"
getwd()
## [1] "C:/Users/preygupta/Documents/Predictive Analytics with R"
setwd("C:/Users/preygupta/Documents/spls")
setwd("C:/Users/preygupta/Documents/Predictive Analytics with R")
b = c(1:4)
typeof(b)
## [1] "integer"
d = c(1,2,3,4)
typeof(d)
## [1] "double"
c = as.integer(c(1,5,7))
typeof(c)
## [1] "integer"
a <- c(1L,2L,3L,4L)
typeof(a)
## [1] "integer"
Homogeneous Heterogeneous
1d Atomic vector List 2d Matrix Data frame nd Array
* A basic data structure of R containing the same type of data
* is.vector() does not test if an object is a vector. Instead it returns TRUE only if the object is a vector with no attributes apart from names. Use is.atomic(x) || is.list(x) to test if an object is actually a vector.
v = vector()
str(v)
logi(0)
typeof(v)
[1] “logical”
v = c()
str(v)
NULL
typeof(v)
[1] “NULL”
v = vector(NULL)
## Error in vector(NULL): invalid 'mode' argument
str(v)
NULL
typeof(v)
[1] “NULL”
mode returns a character string giving the (storage) mode of the object — often the same — both relying on the output of typeof(x)
Modes have the same set of names as types (see typeof) except that
types “integer” and “double” are returned as “numeric”.
types “special” and “builtin” are returned as “function”.
type “symbol” is called mode “name”.
type “language” is returned as “(” or “call”.
v = vector(NA)
## Error in vector(NA): vector: cannot make a vector of mode 'NA'.
str(v)
## NULL
typeof(v)
## [1] "NULL"
v = NA
str(v)
## logi NA
typeof(v)
## [1] "logical"
v = NULL
str(v)
## NULL
typeof(v)
## [1] "NULL"
typeof(NA_integer_)
## [1] "integer"
str(NA_integer_)
## int NA
NULL represents the null object in R: it is a reserved word. NULL is often returned by expressions and functions whose values are undefined. NULL is a special object. It is used whenever there is a need to indicate or specify that an object is absent. It should not be confused with a vector or list of zero length.
The NULL object has no type and no modifiable properties. There is only one NULL object in R, to which all instances refer. To test for NULL use is.null. You cannot set attributes on NULL.
NA is a logical constant of length 1 which contains a missing value indicator. NA can be freely coerced to any other vector type except raw. There are also constants NA_integer_, NA_real_, NA_complex_ and NA_character_ of the other atomic vector types which support missing values: all of these are reserved words in the R language.
NA NULL
NA [1] NA
class(NA) [1] “logical”
NA > 1 [1] NA
NULL NULL
class(NULL) [1] “NULL”
NULL > 1 logical(0)
v <- c( 1, NA, NULL) # A vector ignores NULL
v
## [1] 1 NA
list(1, NA, NULL) # list does not ignore NULL
## [[1]]
## [1] 1
##
## [[2]]
## [1] NA
##
## [[3]]
## NULL
# ask question here
#https://www.r-bloggers.com/r-na-vs-null/
typeof(1)
## [1] "double"
typeof(1L)
## [1] "integer"
typeof(Inf)
## [1] "double"
class(1)
## [1] "numeric"
class(1L)
## [1] "integer"
class(Inf)
## [1] "numeric"
#https://stackoverflow.com/questions/35445112/what-is-the-difference-between-mode-and-class-in-r
The only attributes not lost are the three most important:
Names, a character vector giving each element a name. Syntax: names(x)
Dimensions, used to turn vectors into matrices and arrays. Syntax: dim(x)
Class, used to implement the S3 object system. Syntax : class(x)
y <- 1:10
attr(y, "my_attribute") <- "This is a vector"
attr(y, "my_attribute")
## [1] "This is a vector"
attributes(y)
## $my_attribute
## [1] "This is a vector"
e = c(3,9,4,6,7)
names(e) = c("q","w","e")
e[1]
## q
## 3
e["q"]
## q
## 3
attributes(e)
## $names
## [1] "q" "w" "e" NA NA
dim(e) #vectors are unidimensional hence the dimension is NULL
## NULL
x <- c(a = 1, b = 2, c = 3)
print(x)
## a b c
## 1 2 3
x <- setNames(1:3, c("a", "b", "c"))
print(x)
## a b c
## 1 2 3
# You can create a new vector without names using unname(x), or remove names in place with names(x) <- NULL.
unname(x)
## [1] 1 2 3
names(e) = NULL
print(e)
## [1] 3 9 4 6 7
attributes(pred_wage)
## Error in eval(expr, envir, enclos): object 'pred_wage' not found
k = NA
is.na(k)
## [1] TRUE
a= 0/0
a
## [1] NaN
s = c("a", "b", a)
is.na(a)
## [1] TRUE
is.na(s)
## [1] FALSE FALSE FALSE
is.nan(a)
## [1] TRUE
which(s == "NaN")
## [1] 3
#K == NA
#Questions and notes:-
c(1, c(2, c(3, 4)))
## [1] 1 2 3 4
####Given a vector, you can determine its type with typeof(), or check if it’s a specific type with an “is” function: is.character(), is.double(), is.integer(), is.logical(), or, more generally, is.atomic().
int_var <- c(1L, 6L, 10L)
typeof(int_var)
## [1] "integer"
is.integer(int_var)
## [1] TRUE
is.atomic(int_var)
## [1] TRUE
#### is.numeric() is a general test for the “numberliness” of a vector and returns TRUE for both integer and double vectors.
is.numeric(int_var)
## [1] TRUE
One important use of attributes is to define factors. A factor is a vector that can contain only predefined values, and is used to store categorical data. Factors are built on top of integer vectors using two attributes: the class, “factor”, which makes them behave differently from regular integer vectors, and the levels, which defines the set of allowed values.
x <- factor(c("a", "b", "b", "a"))
x
## [1] a b b a
## Levels: a b
class(x)
## [1] "factor"
typeof(x)
## [1] "integer"
levels(x)
## [1] "a" "b"
class(x[1])
## [1] "factor"
typeof(x[1])
## [1] "integer"
x[1]
## [1] a
## Levels: a b
# While factors look (and often behave) like character vectors, they are actually integers. Be careful when treating them like strings. Some string methods (like gsub() and grepl()) will coerce factors to strings, while others (like nchar()) will throw an error, and still others (like c()) will use the underlying integer values. For this reason, it’s usually best to explicitly convert factors to character vectors if you need string-like behaviour.
# You can't use values that are not in the levels
x[2] <- "c"
## Warning in `[<-.factor`(`*tmp*`, 2, value = "c"): invalid factor level, NA
## generated
print(x)
## [1] a <NA> b a
## Levels: a b
# NB: you can't combine factors
c(factor("a"), factor("b"))
## [1] 1 1
Factors are useful when you know the possible values a variable may take, even if you don’t see all values in a given dataset. Using a factor instead of a character vector makes it obvious when some groups contain no observations:
sex_char <- c("m", "m", "m")
sex_factor <- factor(sex_char, levels = c("m", "f"))
table(sex_char)
## sex_char
## m
## 3
table(sex_factor)
## sex_factor
## m f
## 3 0
Sometimes when a data frame is read directly from a file, a column you’d thought would produce a numeric vector instead produces a factor. This is caused by a non-numeric value in the column, often a missing value encoded in a special way like . or -. To remedy the situation, coerce the vector from a factor to a character vector, and then from a character to a double vector. (Be sure to check for missing values after this process.) Of course, a much better plan is to discover what caused the problem in the first place and fix that; using the na.strings argument to read.csv() is often a good place to start.
z <- read.csv(text = "value\n12\n1\n.\n9\n1\n9")
z
## value
## 1 12
## 2 1
## 3 .
## 4 9
## 5 1
## 6 9
class(z)
## [1] "data.frame"
typeof(z)
## [1] "list"
class(z$value)
## [1] "factor"
typeof(z$value)
## [1] "integer"
levels(z$value)
## [1] "." "1" "12" "9"
as.character(z$value)
## [1] "12" "1" "." "9" "1" "9"
as.double(z$value) # absurd value
## [1] 3 2 1 4 2 4
as.double(as.character(z$value))
## Warning: NAs introduced by coercion
## [1] 12 1 NA 9 1 9
z$value = as.double(as.character(z$value))
## Warning: NAs introduced by coercion
print(z)
## value
## 1 12
## 2 1
## 3 NA
## 4 9
## 5 1
## 6 9
class(z$value)
## [1] "numeric"
z <- read.csv(text = "value\n12\n1\n.\n9\n1\n9", na.strings=".")
typeof(z$value)
## [1] "integer"
class(z$value)
## [1] "integer"
print(z)
## value
## 1 12
## 2 1
## 3 NA
## 4 9
## 5 1
## 6 9
Unfortunately, most data loading functions in R automatically convert character vectors to factors. This is suboptimal, because there’s no way for those functions to know the set of all possible levels or their optimal order. Instead, use the argument stringsAsFactors = FALSE to suppress this behaviour, and then manually convert character vectors to factors using your knowledge of the data.
Manipulating levels in a factor.
f3 <- factor(letters, levels = rev(letters))
f2 <- rev(factor(letters))
levels(f1) = c(levels(f1), "qq")
## Error in levels(f1): object 'f1' not found
levels(f1) = c(levels(f1), 1)
## Error in levels(f1): object 'f1' not found
addNoAnswer <- function(x){
if(is.factor(x)) return(factor(x, levels=c(levels(x), "No Answer")))
return(x)
}
df <- as.data.frame(lapply(df, addNoAnswer))
## Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : argument is missing, with no default
#test it later
#### List is the object which Contains elements of different types – like strings, numbers, vectors and another list inside it. R list can also contain a matrix or a function as its elements. The List is been created using list() Function in R. In other words, a list is a generic vector containing other objects.
Lists are sometimes called recursive vectors, because a list can contain other lists. This makes them fundamentally different from atomic vectors. Atomic vectors are flat.
x <- list(list(list(list())))
str(x)
## List of 1
## $ :List of 1
## ..$ :List of 1
## .. ..$ : list()
is.recursive(x)
## [1] TRUE
x <- list(list(1, 2), c(3, 4))
y <- c(list(1, 2), c(3, 4))
str(x)
## List of 2
## $ :List of 2
## ..$ : num 1
## ..$ : num 2
## $ : num [1:2] 3 4
str(y)
## List of 4
## $ : num 1
## $ : num 2
## $ : num 3
## $ : num 4
class(y)
## [1] "list"
x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
typeof(x)
## [1] "list"
colnames(x)
## NULL
y = unlist(x)
y
## [1] "1" "2" "3" "a" "TRUE" "FALSE" "TRUE" "2.3" "5.9"
typeof(y)
## [1] "character"
y = as.vector(x)
y
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] "a"
##
## [[3]]
## [1] TRUE FALSE TRUE
##
## [[4]]
## [1] 2.3 5.9
typeof(y)
## [1] "list"
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE, FALSE, FALSE )
x = list( n, s, b, 3) # x contains copies of n, s, b
#Naming List
list_data <- list(c("Feb","Mar","Apr"), matrix(c(3,9,5,1,-2,8), nrow = 2), list("green",12.3))
print(list_data)
## [[1]]
## [1] "Feb" "Mar" "Apr"
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 3 5 -2
## [2,] 9 1 8
##
## [[3]]
## [[3]][[1]]
## [1] "green"
##
## [[3]][[2]]
## [1] 12.3
names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")
print(list_data)
## $`1st Quarter`
## [1] "Feb" "Mar" "Apr"
##
## $A_Matrix
## [,1] [,2] [,3]
## [1,] 3 5 -2
## [2,] 9 1 8
##
## $`A Inner list`
## $`A Inner list`[[1]]
## [1] "green"
##
## $`A Inner list`[[2]]
## [1] 12.3
list_data$A_Matrix
## [,1] [,2] [,3]
## [1,] 3 5 -2
## [2,] 9 1 8
list_data$"A_Matrix"
## [,1] [,2] [,3]
## [1,] 3 5 -2
## [2,] 9 1 8
list_data$'A_Matrix'
## [,1] [,2] [,3]
## [1,] 3 5 -2
## [2,] 9 1 8
list_data[[2]]
## [,1] [,2] [,3]
## [1,] 3 5 -2
## [2,] 9 1 8
Adding a dim attribute to an atomic vector allows it to behave like a multi-dimensional array. A special case of the array is the matrix, which has two dimensions. Matrices are used commonly as part of the mathematical machinery of statistics. The column is filled first.
m = c(1:24)
print(m)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24
dim(m) = c(5,5)
## Error in dim(m) = c(5, 5): dims [product 25] do not match the length of object [24]
dim(m) = c(6,4)
m
## [,1] [,2] [,3] [,4]
## [1,] 1 7 13 19
## [2,] 2 8 14 20
## [3,] 3 9 15 21
## [4,] 4 10 16 22
## [5,] 5 11 17 23
## [6,] 6 12 18 24
class(m)
## [1] "matrix"
typeof(m)
## [1] "integer"
dim(m) = c(2,3,4)
m
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
## , , 3
##
## [,1] [,2] [,3]
## [1,] 13 15 17
## [2,] 14 16 18
##
## , , 4
##
## [,1] [,2] [,3]
## [1,] 19 21 23
## [2,] 20 22 24
class(m)
## [1] "array"
typeof(m)
## [1] "integer"
# Two scalar arguments to specify rows and columns
a <- matrix(1:6, ncol = 3, nrow = 2)
print(a)
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
ncol(a)
## [1] 3
nrow(a)
## [1] 2
dim(a)
## [1] 2 3
# One vector argument to describe all dimensions
b <- array(1:12, c(2, 3, 2))
print(b)
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
ncol(b)
## [1] 3
nrow(b)
## [1] 2
dim(b)
## [1] 2 3 2
rownames(a) <- c("A", "B")
colnames(a) <- c("a", "b", "c")
a
## a b c
## A 1 3 5
## B 2 4 6
dimnames(b) <- list(c("one", "two"), c("a", "b", "c"), c("A", "B"))
b
## , , A
##
## a b c
## one 1 3 5
## two 2 4 6
##
## , , B
##
## a b c
## one 7 9 11
## two 8 10 12
Vectors are not the only 1-dimensional data structure. You can have matrices with a single row or single column, or arrays with a single dimension. They may print similarly, but will behave differently. The differences aren’t too important, but it’s useful to know they exist in case you get strange output from a function (tapply() is a frequent offender).As always, use str() to reveal the differences.
q = matrix(1:3, ncol = 1)
print(q)
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
q = array(1:3, c(3))
print(q)
## [1] 1 2 3
(Most commonly used) A data frame is a list of equal-length vectors. This makes it a 2-dimensional structure, so it shares properties of both the matrix and the list.
df <- data.frame(x = 1:3, y = c("a", "b", "c"))
df
## x y
## 1 1 a
## 2 2 b
## 3 3 c
str(df)
## 'data.frame': 3 obs. of 2 variables:
## $ x: int 1 2 3
## $ y: Factor w/ 3 levels "a","b","c": 1 2 3
Beware data.frame()’s default behaviour which turns strings into factors. Use stringAsFactors = FALSE to suppress this behaviour
df <- data.frame(
x = 1:3,
y = c("a", "b", "c"),
stringsAsFactors = FALSE)
str(df)
## 'data.frame': 3 obs. of 2 variables:
## $ x: int 1 2 3
## $ y: chr "a" "b" "c"
typeof(df)
## [1] "list"
class(df)
## [1] "data.frame"
is.data.frame(df)
## [1] TRUE
use as.data.frame()
v = c(1:5)
as.data.frame(v)
## v
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
l = list(a = 1:3,b= c("a","b"))
print(l)
## $a
## [1] 1 2 3
##
## $b
## [1] "a" "b"
as.data.frame(l)
## Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 3, 2
l = list(a = 1:2,b= c("a","b"))
as.data.frame(l)
## a b
## 1 1 a
## 2 2 b
print(l)
## $a
## [1] 1 2
##
## $b
## [1] "a" "b"
You can combine data frames using cbind() and rbind():
cbind(df, data.frame(z = 3:1))
## x y z
## 1 1 a 3
## 2 2 b 2
## 3 3 c 1
cbind(df, data.frame(z = 3:2)) # cbinding with unequal lengths
## Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 2
library(rowr)
cbind.fill(df, data.frame(z = 3:2), fill = NA)
## x y z
## 1 1 a 3
## 2 2 b 2
## 3 3 c NA
dplyr::full_join(df, data.frame(x = 4:5))
## Joining, by = "x"
## x y
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 <NA>
## 5 5 <NA>
rbind(df, data.frame(10,"z"))
## Error in match.names(clabs, names(xi)): names do not match previous names
rbind(df, data.frame(x = 10, y = "z"))
## x y
## 1 1 a
## 2 2 b
## 3 3 c
## 4 10 z
plyr::rbind.fill(df, data.frame(10)) # rbinding with unequal lengths
## x y X10
## 1 1 a NA
## 2 2 b NA
## 3 3 c NA
## 4 NA <NA> 10
plyr::rbind.fill(df, data.frame(x = 20))
## x y
## 1 1 a
## 2 2 b
## 3 3 c
## 4 20 <NA>
# When combining column-wise, the number of rows must match, but row names are ignored. When combining row-wise, both the number and names of columns must match. Use plyr::rbind.fill() to combine
# data frames that don’t have the same columns
It’s a common mistake to try and create a data frame by cbind()ing vectors together. Instead use data.frame() directly
good <- data.frame(a = 1:2, b = c("a", "b"),
stringsAsFactors = FALSE)
str(good)
## 'data.frame': 2 obs. of 2 variables:
## $ a: int 1 2
## $ b: chr "a" "b"
good
## a b
## 1 1 a
## 2 2 b
bad <- data.frame(cbind(a = 1:2, b = c("a", "b")))
str(bad)
## 'data.frame': 2 obs. of 2 variables:
## $ a: Factor w/ 2 levels "1","2": 1 2
## $ b: Factor w/ 2 levels "a","b": 1 2
The conversion rules for cbind() are complicated and best avoided by ensuring all inputs are of the same type.
Since a data frame is a list of vectors, it is possible for a data frame to have a column that is a list:
df <- data.frame(x = 1:3)
df$y = c(4:6)
df[,3]=data.frame(z = c(7:9))
df$y <- list(1:2, 1:3, 1:4)
df
## x y z
## 1 1 1, 2 7
## 2 2 1, 2, 3 8
## 3 3 1, 2, 3, 4 9
str(df)
## 'data.frame': 3 obs. of 3 variables:
## $ x: int 1 2 3
## $ y:List of 3
## ..$ : int 1 2
## ..$ : int 1 2 3
## ..$ : int 1 2 3 4
## $ z: int 7 8 9
dfn = data.frame(x = 1:3, y = list(1:2, 1:3, 1:4))
## Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 2, 3, 4
dfn = data.frame(x = 1:3, y = list(1:3))
str(dfn)
## 'data.frame': 3 obs. of 2 variables:
## $ x : int 1 2 3
## $ X1.3: int 1 2 3
A workaround is to use I(), which causes data.frame() to treat the list as one unit:
dfl <- data.frame(x = 1:3, y = I(list(1:2, 1:3, 1:4)))
str(dfl)
## 'data.frame': 3 obs. of 2 variables:
## $ x: int 1 2 3
## $ y:List of 3
## ..$ : int 1 2
## ..$ : int 1 2 3
## ..$ : int 1 2 3 4
## ..- attr(*, "class")= chr "AsIs"
dfl[3, "y"]
## [[1]]
## [1] 1 2 3 4
dfm <- data.frame(x = 1:3, y = I(matrix(1:9, nrow = 3)))
str(dfm)
## 'data.frame': 3 obs. of 2 variables:
## $ x: int 1 2 3
## $ y: 'AsIs' int [1:3, 1:3] 1 2 3 4 5 6 7 8 9
dfm
## x y.1 y.2 y.3
## 1 1 1 4 7
## 2 2 2 5 8
## 3 3 3 6 9
dfm[2, "y"]
## [,1] [,2] [,3]
## [1,] 2 5 8
Use list and array columns with caution: many functions that work with data frames assume that all columns are atomic vectors.
R will automatically:-
Note:- * read.xlsx(“filename.xlsx”, 1) reads your file and makes the data.frame column classes nearly useful, but is very slow for large data sets.
inputData = read.table(file = "C:/Users/preygupta/Documents/Predictive Analytics with R/train.csv", header = TRUE, sep = "," )
head(inputData)
## X.raw.error X X.1 X.2 X.3 X.4 X.5
## 1 ID datetime temperature var1 pressure windspeed var2
## 2 0 7/1/2013 0:00 0 0 0 571.91 A
## 3 1 7/1/2013 1:00 -12.1 -19.3 996 575.04 A
## 4 2 7/1/2013 2:00 -12.9 -20 1000 578.435 A
## 5 3 7/1/2013 3:00 -11.4 -17.1 995 582.58 A
## 6 4 7/1/2013 4:00 -11.4 -19.3 1005 586.6 A
## X.6
## 1 electricity_consumption
## 2 216
## 3 210
## 4 225
## 5 216
## 6 222
str(inputData$var2)
## NULL
levels(inputData$var2)
## NULL
inputData = read.table(file = "C:/Users/preygupta/Documents/Predictive Analytics with R/train.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE )
levels(inputData$var2)
## NULL
str(inputData$var2)
## NULL
system.time(read.table(file = "C:/Users/preygupta/Documents/Predictive Analytics with R/train.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE ))
## user system elapsed
## 0.07 0.00 0.07
system.time(read.csv2(file = "C:/Users/preygupta/Documents/Predictive Analytics with R/train.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE ))
## user system elapsed
## 0.03 0.01 0.05
inputData = read.table(file = "C:/Users/preygupta/Documents/Predictive Analytics with R/train.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE, comment.char = "^", nrows = 10 )
ch = c("a","d")
dump(list = c("ch","t", "inputData"), file = "test1.R", append = FALSE, envir = parent.frame(), evaluate = TRUE)
#remove ch from enviroment and then execute it
source("test1.R")
y <- data.frame(a = 1, b = "a")
dput(y)
## structure(list(a = 1, b = structure(1L, .Label = "a", class = "factor")), class = "data.frame", row.names = c(NA,
## -1L))
Data are read in using connection interfaces. Connections can be made to ???les (most common) or to other more exotic things.
con <- file("C:/Users/preygupta/Documents/Predictive Analytics with R/train.csv", "r")
data <- read.csv(con)
head(data)
## X.raw.error X X.1 X.2 X.3 X.4 X.5
## 1 ID datetime temperature var1 pressure windspeed var2
## 2 0 7/1/2013 0:00 0 0 0 571.91 A
## 3 1 7/1/2013 1:00 -12.1 -19.3 996 575.04 A
## 4 2 7/1/2013 2:00 -12.9 -20 1000 578.435 A
## 5 3 7/1/2013 3:00 -11.4 -17.1 995 582.58 A
## 6 4 7/1/2013 4:00 -11.4 -19.3 1005 586.6 A
## X.6
## 1 electricity_consumption
## 2 216
## 3 210
## 4 225
## 5 216
## 6 222
str(data)
## 'data.frame': 26497 obs. of 8 variables:
## $ X.raw.error: Factor w/ 26497 levels "0","1","10","100",..: 26497 1 2 8425 16656 21439 22326 23109 23928 24675 ...
## $ X : Factor w/ 26497 levels "1/1/2014 0:00",..: 26497 19873 19874 19885 19890 19891 19892 19893 19894 19895 ...
## $ X.1 : Factor w/ 61 levels "-0.7","-1.4",..: 61 25 6 7 5 5 4 8 5 4 ...
## $ X.2 : Factor w/ 72 levels "-0.7","-1.4",..: 72 45 16 19 13 16 16 13 14 15 ...
## $ X.3 : Factor w/ 75 levels "0","1000","1001",..: 75 1 71 2 70 7 15 8 72 14 ...
## $ X.4 : Factor w/ 5604 levels "1.075","1.2",..: 5604 4357 4358 4359 4406 4407 1482 3337 4880 393 ...
## $ X.5 : Factor w/ 4 levels "A","B","C","var2": 4 1 1 1 1 1 1 1 1 1 ...
## $ X.6 : Factor w/ 253 levels "1002","1059",..: 253 30 28 33 30 32 30 31 32 31 ...
close(con)
readLines can be useful for reading in lines of webpages
## This might take time
con <- url("http://www.jhsph.edu", "r")
x <- readLines(con)
head(x)
## [1] "<!DOCTYPE html>"
## [2] "<html lang=\"en\">"
## [3] ""
## [4] "<head>"
## [5] "<meta charset=\"utf-8\" />"
## [6] "<title>Johns Hopkins Bloomberg School of Public Health</title>"
R’s subsetting operators are powerful and fast. Mastery of subsettingallows you to succinctly express complex operations in a way that few other languages can match.
There are a number of operators that can be used to extract subsets of R objects.
x <- c(2.1, 4.2, 3.3, 5.4)
# subsetting with +ve integer.
x[c(3, 1)]
## [1] 3.3 2.1
x[order(x)]
## [1] 2.1 3.3 4.2 5.4
x[c(2.1, 2.9)]
## [1] 4.2 4.2
# Negative integers omit elements at the specified positions:
x[-c(3, 1)]
## [1] 4.2 5.4
### You can’t mix positive and negative integers in a single subset.
x[c(-1, 2)]
## Error in x[c(-1, 2)]: only 0's may be mixed with negative subscripts
# Logical vectors select elements where the corresponding logical value is TRUE.
x[c(TRUE, TRUE, FALSE, FALSE)]
## [1] 2.1 4.2
x[x > 3]
## [1] 4.2 3.3 5.4
### If the logical vector is shorter than the vector being subsetted, it will be recycled to be the same length.
x[c(TRUE, FALSE)]
## [1] 2.1 3.3
x[c(TRUE, FALSE, TRUE, FALSE)]
## [1] 2.1 3.3
### A missing value in the index always yields a missing value in the output
x[c(TRUE, TRUE, NA, FALSE)]
## [1] 2.1 4.2 NA
# Nothing returns the original vector. This is not useful for vectors but is very useful for matrices, data frames, and arrays.
x[]
## [1] 2.1 4.2 3.3 5.4
# Character vectors
y = setNames(x, letters[1:4])
y[c("d", "c", "a")]
## d c a
## 5.4 3.3 2.1
############ QUick notes. Useful tips.
letters
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
LETTERS
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
month.abb
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"
month.name
## [1] "January" "February" "March" "April" "May"
## [6] "June" "July" "August" "September" "October"
## [11] "November" "December"
############
# When subsetting with [ names are always matched exactly
z <- c(abc = 1, def = 2)
z[c("a", "d")]
## <NA> <NA>
## NA NA
Subsetting a list works in the same way as subsetting an atomic vector. Using [ will always return a list; [[ and $, will be discussed in detail soon.
a <- matrix(10:18, nrow = 3)
colnames(a) = c("A", "B", "C")
print(a)
## A B C
## [1,] 10 13 16
## [2,] 11 14 17
## [3,] 12 15 18
a[1:2, ]
## A B C
## [1,] 10 13 16
## [2,] 11 14 17
class(a[1:2, ]) # first is row and then is column so it is basically [row,column]
## [1] "matrix"
a[c(4,7)] # accessing row wise
## [1] 13 16
a[c(T, F, T), c("B", "A")]
## B A
## [1,] 13 10
## [2,] 15 12
a[0,]
## A B C
class(a[0,-2])
## [1] "matrix"
class(a[0,]) # By default, [ will simplify the results to the lowest possible dimensionality
## [1] "matrix"
Because matrices and arrays are implemented as vectors with special attributes, you can subset them with a single vector. In that case, they will behave like a vector. Arrays and Matrices in R are stored in column-major order:
(vals <- outer(1:5, 1:5, FUN = "paste", sep = ","))
## [,1] [,2] [,3] [,4] [,5]
## [1,] "1,1" "1,2" "1,3" "1,4" "1,5"
## [2,] "2,1" "2,2" "2,3" "2,4" "2,5"
## [3,] "3,1" "3,2" "3,3" "3,4" "3,5"
## [4,] "4,1" "4,2" "4,3" "4,4" "4,5"
## [5,] "5,1" "5,2" "5,3" "5,4" "5,5"
class(vals)
## [1] "matrix"
vals[c(4,15)] # accessing a 2-d matrix using single index
## [1] "4,1" "5,3"
vals[15]
## [1] "5,3"
select <- matrix(ncol = 2, byrow = TRUE, c( # accessing a matrix with a matrix of indicies.Similarly we can do this forarray
1, 1,
3, 1,
2, 4
))
vals[select]
## [1] "1,1" "3,1" "2,4"
vals[[2,2]]
## [1] "2,2"
class(vals[[2,2]])
## [1] "character"
Data frames possess the characteristics of both lists and matrices: if you subset with a single vector, they behave like lists; if you subset with two vectors, they behave like matrices.
df <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])
str(df)
## 'data.frame': 3 obs. of 3 variables:
## $ x: int 1 2 3
## $ y: int 3 2 1
## $ z: Factor w/ 3 levels "a","b","c": 1 2 3
df$x == 2
## [1] FALSE TRUE FALSE
df[df$x == 2, ]
## x y z
## 2 2 2 b
df[c(1, 3), ]
## x y z
## 1 1 3 a
## 3 3 1 c
# There are two ways to select columns from a data frame
# Like a list(same as a vector):
df[c("x", "z")]
## x z
## 1 1 a
## 2 2 b
## 3 3 c
str(df[c("x", "z")])
## 'data.frame': 3 obs. of 2 variables:
## $ x: int 1 2 3
## $ z: Factor w/ 3 levels "a","b","c": 1 2 3
# Like a matrix
df[, c("x", "z")]
## x z
## 1 1 a
## 2 2 b
## 3 3 c
str(df[, c("x", "z")])
## 'data.frame': 3 obs. of 2 variables:
## $ x: int 1 2 3
## $ z: Factor w/ 3 levels "a","b","c": 1 2 3
df["x"]
## x
## 1 1
## 2 2
## 3 3
str(df["x"])
## 'data.frame': 3 obs. of 1 variable:
## $ x: int 1 2 3
df[,"x"]
## [1] 1 2 3
str(df[,"x"])
## int [1:3] 1 2 3
# There's an important difference if you select a single
# column: matrix subsetting simplifies by default, list
# subsetting does not.
useful for lists and data frames.
There are two other subsetting operators: [[ and $. [[ is similar to [, except it can only return a single value and it allows you to pull pieces out of a list. $ is a useful shorthand for [[ combined with character subsetting.
You need [[ when working with lists. This is because when [ is applied to a list it always returns a list: it never gives you the contents of the list. To get the contents, you need [[
Because it can return only a single value, you must use [[ with either a single positive integer or a string
x <- list(foo = 1:4, bar = 0:6)
x[1]
## $foo
## [1] 1 2 3 4
class(x[1])
## [1] "list"
x[[1]]
## [1] 1 2 3 4
class(x[[1]])
## [1] "integer"
x$foo
## [1] 1 2 3 4
class(x$foo)
## [1] "integer"
x$"foo"
## [1] 1 2 3 4
class(x$"foo")
## [1] "integer"
x$'foo'
## [1] 1 2 3 4
class(x$'foo')
## [1] "integer"
x[[2]][c(1:3)]
## [1] 0 1 2
x$foo[3]
## [1] 3
x$bar[c(3:5)]
## [1] 2 3 4
x["bar"]
## $bar
## [1] 0 1 2 3 4 5 6
class(x["bar"])
## [1] "list"
x[["bar"]]
## [1] 0 1 2 3 4 5 6
class(x[["bar"]])
## [1] "integer"
x <- list(foo = 1:4, bar = 0:6, baz = "hello")
x[c(1, 3)]
## $foo
## [1] 1 2 3 4
##
## $baz
## [1] "hello"
str(x)
## List of 3
## $ foo: int [1:4] 1 2 3 4
## $ bar: int [1:7] 0 1 2 3 4 5 6
## $ baz: chr "hello"
x[c("foo","baz")]
## $foo
## [1] 1 2 3 4
##
## $baz
## [1] "hello"
x <- list(a = list(10,11,12), b = c(3.14, 2.81))
x[c(1,3)]
## $a
## $a[[1]]
## [1] 10
##
## $a[[2]]
## [1] 11
##
## $a[[3]]
## [1] 12
##
##
## $<NA>
## NULL
class(x[c(1,3)])
## [1] "list"
x[[c(1,3)]]
## [1] 12
class(x[[c(1,3)]])
## [1] "numeric"
str(x)
## List of 2
## $ a:List of 3
## ..$ : num 10
## ..$ : num 11
## ..$ : num 12
## $ b: num [1:2] 3.14 2.81
more examples
x <- list(foo = 1:4, bar = 0.6, baz = "hello")
name <- "foo"
x[[name]] ## computed index for ‘foo’
## [1] 1 2 3 4
x$name ## element ‘name’ doesn’t exist! NULL
## NULL
x$foo
## [1] 1 2 3 4
x <- c("a", "b", "c", "c", "d", "a")
x[1]
## [1] "a"
class(x[1])
## [1] "character"
x[1:4]
## [1] "a" "b" "c" "c"
x[x>"a"]
## [1] "b" "c" "c" "d"
u <- x > "a"
x[u]
## [1] "b" "c" "c" "d"
#Because data frames are lists of columns, you can use [[ to extract a column from data frames:
mtcars[[1]]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
mtcars[["cyl"]]
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
It’s important to understand the distinction between simplifying and preserving subsetting. Simplifying subsets returns the simplest possible data structure that can represent the output, and is useful interactively because it usually gives you what you want. Preserving subsetting keeps the structure of the output the same as the input, and is generally better for programming because the result will always be the same type. Omitting drop = FALSE when subsetting matrices and data frames is one of the most common sources of programming errors.
Simplifying Preserving
Vector x[[1]] x[1] List x[[1]] x[1] Factor x[1:4, drop = T] x[1:4] Array x[1, ] or x[, 1] x[1, , drop = F] or x[, 1, drop = F] Data frame x[, 1] or x[[1]] x[, 1, drop = F] or x[1]
Preserving is the same for all data types: you get the same type of output as input.
Simplifying behaviour varies slightly between different data types, as described below:
x <- c(a = 1, b = 2)
x[1]
## a
## 1
x[[1]]
## [1] 1
y <- list(a = 1, b = 2)
str(y[1])
## List of 1
## $ a: num 1
str(y[[1]])
## num 1
z <- factor(c("a", "b", "a", "b", "a", "b", "a", "b"))
z[1]
## [1] a
## Levels: a b
z[1, drop = TRUE]
## [1] a
## Levels: a
z[c(1,2), drop = TRUE]
## [1] a b
## Levels: a b
a <- matrix(1:4, nrow = 2)
a[1, , drop = FALSE]
## [,1] [,2]
## [1,] 1 3
class(a[1, , drop = FALSE])
## [1] "matrix"
a[1,]
## [1] 1 3
class(a[1,])
## [1] "integer"
df <- data.frame(a = 1:2, b = 1:2)
str(df[1])
## 'data.frame': 2 obs. of 1 variable:
## $ a: int 1 2
class(df[1])
## [1] "data.frame"
typeof(df[1])
## [1] "list"
str(df[[1]])
## int [1:2] 1 2
class(df[[1]])
## [1] "integer"
str(df[, "a", drop = FALSE])
## 'data.frame': 2 obs. of 1 variable:
## $ a: int 1 2
str(df[, "a"])
## int [1:2] 1 2
$ is a shorthand operator, where x$y is equivalent to x[[“y”, exact = FALSE]].
One common mistake with $ is to try and use it when you have the name of a column stored in a variable
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
var <- "cyl"
mtcars$var
## NULL
mtcars[[var]]
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars$cyl
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
There’s one important difference between $ and [[. $ does partial matching
x <- list(abc = 1, b=2)
x$a
## [1] 1
x[["a"]]
## NULL
The following table summarises the results of subsetting atomic vectors and lists with [ and [[ and different types of OOB value.
Operator Index Atomic List [ OOB NA list(NULL) [ NA_real_ NA list(NULL) [ NULL x[0] list(NULL) [[ OOB Error Error [[ NA_real_ Error NULL [[ NULL Error Error
If the input vector is named, then the names of OOB, missing, or NULL components will be “
x <- 1:5
x[-1] <- 4:1
mtcars[] <- lapply(mtcars, as.integer)
mtcars <- lapply(mtcars, as.integer)
With lists, you can use subsetting + assignment + NULL to remove components from a list. To add a literal NULL to a list, use [ and list(NULL)
x <- list(a = 1, b = 2)
x[["b"]] <- NULL
str(x)
## List of 1
## $ a: num 1
x <- list(a = 1, b = 2)
x["b"] <- list(NULL)
str(x)
## List of 2
## $ a: num 1
## $ b: NULL
y <- list(a = 1)
y["b"] <- list(NULL)
str(y)
## List of 2
## $ a: num 1
## $ b: NULL
x <- c(1, 2, NA, 4, NA, 5)
bad <- is.na(x)
x = x[!bad]
dim(airquality)
## [1] 153 6
airquality[1:6, ]
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
good <- complete.cases(airquality)
airquality[good, ][1:6, ]
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
lapply: Loop over a list and evaluate a function on each element sapply: Same as lapply but try to simplify the result apply: Apply a function over the margins of an array tapply: Apply a function over subsets of a vector mapply: Multivariate version of lapply An auxiliary function split is also useful, particularly in conjunction with lapply.
lapply takes three arguments: (1) a list X; (2) a function (or the name of a function) FUN; (3) other arguments via its … argument. If X is not a list, it will be coerced to a list using as.list.
lapply always returns a list, regardless of the class of the input.
x <- list(a = 1:5, b = rnorm(10))
lapply(x, mean)
## $a
## [1] 3
##
## $b
## [1] 0.3864866
x <- list(a = matrix(1:4, 2, 2), b = matrix(1:6, 3, 2))
lapply(x, function(elt) elt[,1])
## $a
## [1] 1 2
##
## $b
## [1] 1 2 3
sapply will try to simplify the result of lapply if possible.
x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))
lapply(x, mean)
## $a
## [1] 2.5
##
## $b
## [1] -0.2139658
##
## $c
## [1] 0.6853578
##
## $d
## [1] 4.839964
sapply(x, mean)
## a b c d
## 2.5000000 -0.2139658 0.6853578 4.8399644
It is most often used to apply a function to the rows or columns of a matrix It can be used with general arrays, e.g. taking the average of an array of matrices It is not really faster than writing a loop, but it works in one line!
str(apply)
## function (X, MARGIN, FUN, ...)
X is an array MARGIN is an integer vector indicating which margins should be “retained”. FUN is a function to be applied … is for other arguments to be passed to FUN
x <- matrix(1:9, 3, 3)
apply(x, 2, mean)
## [1] 2 5 8
apply(x, 1, mean)
## [1] 4 5 6
# is for row and 2 is for column