Miscellaneous Functions

Topics

Looping functions

Looping functions

  • Looping functions allow us to apply functions in an array, matrix or data frame.
  • These functions are apply, tapply, sapply, and lapply
  • Let us create a matrix for understanding these functions
mat=matrix(1:12,nrow=3)
dimnames(mat)=list(LETTERS[1:3],LETTERS[4:7])
mat
  D E F  G
A 1 4 7 10
B 2 5 8 11
C 3 6 9 12

apply function

  • apply function takes 3 arguments
  • (data structure, dimension, function to compute)
  • dimension: 1=rowwise, 2=columnwise
a1=apply(mat,1,mean)
a1
  A   B   C 
5.5 6.5 7.5 
a2=apply(mat,2,mean)
a2
 D  E  F  G 
 2  5  8 11 

tapply (tabular apply) function

  • tapply function takes 3 arguments
  • (data vector, factor, function to compute)
x=1:10
f=c(rep(1,3),rep(2,5),rep(3,2))
f
 [1] 1 1 1 2 2 2 2 2 3 3
tapply(x,f,mean)
  1   2   3 
2.0 6.0 9.5 

lapply (list apply) function

  • lapply function takes 2 arguments
  • (list, function to compute)
  • the output from lapply is also a list
x=list(a=1:5,b=c(4,6,5))
lapply(x,mean)
$a
[1] 3

$b
[1] 5

lapply (list apply) function

  • Suppose we want to multiple files from a directory
setwd("C:/rpractice/shiny/AOL")
files= list.files(pattern="*.csv")
files # show filenames in the working directory
[1] "AOLCourse.csv" "results1.csv" 
rdf=lapply(files, read.csv) # files are loaded into rdf

sapply (simplified list apply) function

  • same as lapply except that the output is shown in simplied form
x=list(a=1:5,b=c(4,6,5))
sapply(x,mean)
a b 
3 5 

Factor Variables

Factor Variables

  • Factors are used to represent categorical data.
  • Factors can be unordered or ordered.
  • Using factors with labels is better than using integers because factors are self-describing;
  • Having a variable that has values “Male” and “Female” is better than a variable that has values 1 and 2.

Factor examples

x = factor(c("yes", "yes", "no", "yes", "no"))
x
[1] yes yes no  yes no 
Levels: no yes
table(x)
x
 no yes 
  2   3 
unclass(x)
[1] 2 2 1 2 1
attr(,"levels")
[1] "no"  "yes"

Factor order

  • Order of the levels can be set using the levels argument to factor()
  • This can be important in linear modelling because the first level is used as the baseline level.
x = factor(c("yes", "yes", "no", "yes"), levels = c("yes", "no"))
x; unclass(x)
[1] yes yes no  yes
Levels: yes no
[1] 1 1 2 1
attr(,"levels")
[1] "yes" "no" 

Missing Values

Missing Values

Missing values are denoted by NA or NaN for undefined mathematical operations.

  • is.na() is used to test objects if they are NA
  • is.nan() is used to test for NaN (Not a Number)
  • NA values have a class also, so there are integer NA, character NA, etc.
  • A NaN value is also NA but the converse is not true

Missing Values

x = c(1, 2, NA, 10, 3)
is.na(x);is.nan(x)
[1] FALSE FALSE  TRUE FALSE FALSE
[1] FALSE FALSE FALSE FALSE FALSE
x = c(1, 2, NaN, NA, 4)
is.na(x);is.nan(x)
[1] FALSE FALSE  TRUE  TRUE FALSE
[1] FALSE FALSE  TRUE FALSE FALSE

Removing Missing Values

A common task is to remove missing values (NAs).

x = c(1, 2, NA, 4, NA, 5)
bad = is.na(x)
x[!bad]
[1] 1 2 4 5