Statistical Modeling: Introduction to R

Rasim Muzaffer Musal

Overview I

  • Learning a scripting language for statistical analysis is important for reproducible and documentable, workflows.

  • In this set of lectures we will start learning basics behind R, the programming language we will be working with in this course.

  • R is written with C, Fortran and R itself.

    •   R contains a lot of overhead, one way to write faster running code is to be able to write code that is sent to C with minimal R overhead.   
  • It is a high level declarative (mainly) language.

    •   The order you write a line of code matters.

Overview II

  • Before we dive into statistical concepts we need to clarify a set of concepts.

  • Classes, Objects, Methods, functions, data structures and types, for loops, vectorization and broadcasting are the main topics we will discuss.

  • You will not have to memorize any of these concepts for multiple choice tests!

What are classes and objects

  • Classes are descriptions of how data is organized and what methods are available to call from it.

  • Examples of classes are dataframe and matrix which are a rectangular form that can hold different types of data.

  • Objects are instances of classes.

object=data.frame(F.Name="",L.Name="")
  • Another way:
object=as.data.frame(matrix(nrow=5,ncol=2))
names(object)=c("F.Name","L.Name")

What are functions and methods?

  • Methods are a set of instructions on objects.
print(object)
  F.Name L.Name
1     NA     NA
2     NA     NA
3     NA     NA
4     NA     NA
5     NA     NA
  • Functions are also a set of instructions but are not necessarily associated with objects.

  • Below, the rnorm function generates 5 normally distributed with mean (\(\mu\)) 0 and standard deviation (\(\sigma\)) 1.

rnorm(n=5)
[1] -0.1936169  0.4596331  0.1286988 -0.9054287 -0.1109784

What are functions and methods?

  • In R many of the functions have default parameters such as the rnorm has mean and standard deviation as 0 and 1.

  • To change it, we need to specify new parameter values within the parenthesis of the function.

rnorm(n=5)
[1] -0.9791609  1.5026285 -0.1769093  2.7588748  2.0323952
rnorm(n=5,mean=10,sd=3)
[1]  7.188895 12.729607  7.202865  8.825103  8.052225

Data types in R? Atomic and Recursive Classes

  • Data in R can be either atomic, containing only one type of data limited to character, logical, numeric, integer (vectors, matrix, arrays)

  • or it can be recursive (lists, data frames) that can contain any type of data.

Atomic Classes: Character

  • Character: These are strings. You can do count operations on them.
  • Familiarize yourself with error messages.
x<-c("a","b","c","a","a")
#Counts number of times each value occur. 
table(x)
x
a b c 
3 1 1 
# Look at the error messages
x*2
Error in x * 2: non-numeric argument to binary operator
y=c("0","1","3","4","5")
y+1
Error in y + 1: non-numeric argument to binary operator

Atomic Classes: Numeric/Double

  • Numeric/Double: real numbers, in addition to counting you can do arithmetic operations with.
y=c(0,1,3,4,5)
(y+1)*2
[1]  2  4  8 10 12
is.integer(y)
[1] FALSE
is.double(y)
[1] TRUE

Atomic Classes: Integer

  • Integer: Integers that can be stored with less space than double. Arithmetic operations on objects stored as integers are coerced into double.
is.integer(y)
[1] FALSE
is.double(y)
[1] TRUE

Atomic Classes: Int vs Double

Coerce object y to be an integer and compare sizes.

y_int<-as.integer(y)
is.integer(y_int)
[1] TRUE
is.double(y_int)
[1] FALSE
object.size(y)
96 bytes
object.size(y_int)
80 bytes

How about when we increase the number of values from 5 to 10.

Atomic Classes: Int vs Double

Recall that when we had 5 single digit numbers, the object defined with double was 96 bytes and the object defined with integers is 80 bytes. When we increase the size of these objects to 10 single digit numbers the respective sizes jump to 176 and and 96. This is a good reason why knowing about these differences are important when you are building models with large/big data.

y=c(0,1,2,3,4,5,6,7,8,9)
y_int<-as.integer(y)
object.size(y)
176 bytes
object.size(y_int)
96 bytes

Atomic Classes: Logical

  • Logical: TRUE/FALSE values, usually obtained as a result of comparison p>0.05. These logical values translate to 1 for TRUE and 0 for FALSE.
p<-c(0.01,0.03,0.15,0.02,0.3)
p>0.05
[1] FALSE FALSE  TRUE FALSE  TRUE
logical=p>0.05
object.size(logical)
80 bytes
sum(p>0.05)
[1] 2
  • See above how the vector of logical values summed up to 2. We will use this quite often.

Atomic Classes: Others

  • Atomic classes that we will not use in this class.

  • Complex: A number with an imaginary component. For instance “i”.

https://www.geeksforgeeks.org/imaginary-numbers/

  • Raw: Byte representation of data. We will not be using this in class.

https://stat.ethz.ch/R-manual/R-devel/library/base/html/raw.html

Objects: Data Structures

  • Vectors: Single dimensional collection of data in the same type

  • Lists: Single dimensional collection of data in any type.

  • Arrays: Single or more dimensional data of same type

  • Matrices: Two dimensional arrays.

  • DataFrames: Two dimensional table composed of vectors

  • Factors: Factor is a function that can create a set of fixed number of levels in vector(s) which can be a part of a dataframe. https://www.geeksforgeeks.org/r-factors/ https://r4ds.hadley.nz/factors.html

Objects: Others

  • Regression/Hypothesis: These objects are going to hold information about the statistical procedures. We can use methods to grab this information.
num_rows=100
data_object=as.data.frame(matrix(nrow=num_rows,ncol=2))
data_object[,1]=rnorm(n=num_rows,mean=0,sd=1)
data_object[,2]=data_object[,1]*0.7+rnorm(n=num_rows,mean=0,sd=1)
names(data_object)=c("Y","X")
regression_object=lm(Y~X,data=data_object)
regression_object$coefficients
(Intercept)           X 
  0.1164081   0.4317827 
summary_regression_object=summary(regression_object)
summary_regression_object$r.squared
[1] 0.2675999

What type of a language is R?

  • Imperative vs Declarative (ignoring functional and object oriented)

    • R is mainly a imperative language.
    • Each line has an instruction that has an effect on the program flow.
    • SQL in contrast is a declarative language. You declare what you want, from where you want it.
  • Definitions for language types are not black and white and most languages have a bit of both.

RStudio (Posit) is a GUI

  • Supports multiple languages not only R.

Basic Concepts in Scripting- Control Flows

  • for loops – for(i in 1:END) { Statements to be Repeated Written Here }

– for(i in START:END){ i is the index variable it will take values from START (say 1) to the value END (say 10) represents values that i is going to take, unless otherwise scripted, iterations happen by 1 unit at a time }

  • Other control structures exist such as While

Basic Concepts in Scripting: For Loops

  • A for loop in declares a variable to index. In the example below this is “i”.

  • Everything within brackets is run for the times specified within the parenthesis. In the example below we specify there will be a total of 3 runs.

  • As the statements run, the value “i” changes from 1 to 3.

for(i in 1:3){
  print(i)
}
[1] 1
[1] 2
[1] 3

Basic Concepts in Scripting: For Loops

  • We use loops to repetitively run statements. Below is a hypothetical set of statements.
  • In the example below
  • The data frame ccFraud contains the dependent variable in the very last column, rest are independent variables. 
  • AIC is defined as a single dimension array with the number of elements equivalent to the number of independent variables.
  • We use as independent variables each column separately, run a logistic regression and obtain model comparison measure AIC, store it in the object AIC. 

Basic Concepts in Scripting: For Loops

#AIC object is defined
AIC=array(dim=c(ncol(ccFraud)-1))
END=ncol(ccFraud)-1

#For loop is going to run from i being equal to i to number of columns - 1.
for(i in 1:END){

#Store summary logistic regression information
#The log_regression object is going to be rewritten at every iteration.
log_regression=  summary(glm(Fraud~ccFraud[,i],family=binomial(link='logit'),data=ccFraud))

#AIC is populated at i'th element
AIC[i]=log_regression$aic
}

Basic Concepts in Scripting: Linear Algebra Vectorization

“Vectorization is the process of transforming a scalar operation acting on individual data elements (Single Instruction Single Data—SISD) to an operation where a single instruction operates concurrently on multiple data elements (SIMD)” https://www.sciencedirect.com/topics/computer-science/vectorization#

    a=c(1,3,5)
    b=c(6,4,2)

    a+b
[1] 7 7 7

Basic Concepts in Scripting: Vectorization

So what happens in the chunk of code above? A vector is added to another vector (in r, a scalar is a vector)

1 3 5
+ + +
6 4 2
= = =
7 7 7

Basic Concepts in Scripting: Vectorization

  • How do we calculate residuals in regression (Observed - Predicted)

  • This is a vectorized operation, operations are made in parallel.

    y=c(10,20,30,50)
    yhat=c(8,22,28,52)
    y-yhat
[1]  2 -2  2 -2
  • It would be useful to dig into this topic https://www.noamross.net/archives/2014-04-16-vectorization-in-r-why/

Basic Concepts in Scripting: For Loops vs Vectorization

  • R “For” loops are not efficient but vectorization can not be used for every operation not every function is vectorized.
numcols=10000
numrows=1000
system.time({rho_vec=matrix(data=rnorm(n=numcols*numrows,mean=0,sd=1),nrow=numrows,ncol=numcols)})
   user  system elapsed 
   0.38    0.02    0.39 
rho_for=matrix(nrow=numrows,ncol=numcols)
system.time(for(i in 1:numcols){rho_for[,i]=rnorm(n=numrows,mean=0,sd=1)}
)
   user  system elapsed 
   0.33    0.04    0.38 

Basic Concepts in Scripting: For Loops vs Vectorization

  • However vectorized functions, can provide exponential efficiencies.
means_vec=array(dim=c(numcols))
means_for=array(dim=c(numcols))
system.time({
means_vec=colMeans(rho_vec)
})
   user  system elapsed 
      0       0       0 
system.time({
  for(i in 1:numcols){
means_for[i]=rho_for[,i]}
})
   user  system elapsed 
   1.34    0.30    1.64 

Basic Concepts in Scripting: Recycling/Broadcasting

  • When you operate on two objects, matching the dimension of one object to the other for the operations is called broadcasting. Recycling works in a similar fashion however rather than matching the dimensions of two objects, the smaller object is copied until it has the same dimension as the larger object.

  • Base R uses recycling https://stackoverflow.com/questions/42893238/recycling-higher-dimensional-arrays

Basic Concepts in Scripting: Recycling

  • When you add a vector to a scalar, the scalar is recycled to match the dimensions of the vector.
    a=c(1,3,5)
    a+1
[1] 2 4 6
1 3 5
+ + +
1 1 1
= = =
2 4 6

Basic Concepts in Scripting: Recycling

Adding two vectors of sizes that do not conform to each other. Objects a and b are vectors.

    a=c(1,3,5)
    b=c(5,7)
    a+b
Warning in a + b: longer object length is not a multiple of shorter object
length
[1]  6 10 10
1 3 5
+ + +
5 7 5
= = =
6 10 10

Basic Concepts in Scripting: Recycling

How does this work with multiplication?

    b=c(5,7)
    X=c(5,10,2,-5,3)
    b*X
Warning in b * X: longer object length is not a multiple of shorter object
length
[1]  25  70  10 -35  15
5 7 5 7 5
\(\times\) \(\times\) \(\times\) \(\times\) \(\times\)
5 10 2 -5 3
= = = = =
25 70 10 -35 15

Arrays of Higher Dimensions

Create 2, 3 dimensional arrays with each dimension having 2,3 and 2 as sizes of these dimensions. We do not specify any data in the arrays therefore it will have NA values.

a_1=array(dim=c(2,3,2))
a_2=array(dim=c(2,3,2))
dim(a_1)
[1] 2 3 2
length(dim(a_1))
[1] 3

3-d Arrays

Print a_1 before we populate it.

a_1
, , 1

     [,1] [,2] [,3]
[1,]   NA   NA   NA
[2,]   NA   NA   NA

, , 2

     [,1] [,2] [,3]
[1,]   NA   NA   NA
[2,]   NA   NA   NA

Nested Loops

We will populate the arrays arbitrarily with a nested for loop structure.

for(i in 1:2){for(j in 1:3){for(k in 1:2){
  a_1[i,j,k]=i+j+k
  a_2[i,j,k]=i+j+k-1
  }
}
  }

What are the values populating the arrays?

Populated 3-d Arrays

The object a_1 has the following elements when k=1 and k=2:

a_1
, , 1

     [,1] [,2] [,3]
[1,]    3    4    5
[2,]    4    5    6

, , 2

     [,1] [,2] [,3]
[1,]    4    5    6
[2,]    5    6    7

Recycling for Higher Dimensions

Add to the array, vector d.

d=c(1,5,10)
z_1=a_1+d
is.array(z_1)
[1] TRUE

The vector d is recycled by filling columns to conform to the dimensions of a_1 the following structure for k=1 and k=2:

1 10 5
5 1 10

Resulting in:

3-d array + Vector Result:

The result is:

z_1
, , 1

     [,1] [,2] [,3]
[1,]    4   14   10
[2,]    9    6   16

, , 2

     [,1] [,2] [,3]
[1,]    5   15   11
[2,]   10    7   17

Adding Higher Dimensional Arrays

  • We can add arrays of conforming dimensions
a_1+a_2
, , 1

     [,1] [,2] [,3]
[1,]    5    7    9
[2,]    7    9   11

, , 2

     [,1] [,2] [,3]
[1,]    7    9   11
[2,]    9   11   13

Error in Recycling

  • What happens when we add two arrays which do not have the same dimensions.

  • If we convert object d to an array, it does not get recycled from a dimension of 1 to dimensions of 3 and you get an error.

d=as.array(d)
length(dim(d))
[1] 1
length(dim(a_1))
[1] 3
a_1+d
Error in a_1 + d: non-conformable arrays

Unlike vectors, smaller arrays are not recycled to conform to the dimensions of the larger object.

Coercion in R

  • Coercion is the mutation of the type or structure of a data/object due to operations done on them.

  • We actually have already seen an example of coercion.

p=c(0.7,0.2,0.6,0.8,0.9)
Pred_Success=p>0.5
Pred_Success
[1]  TRUE FALSE  TRUE  TRUE  TRUE
sum(Pred_Success)
[1] 4
  • In the example above, in order for sum function to work the Logical types have been coerced into integers of 1 (TRUE) or 0 (FALSE).

https://www2.stat.duke.edu/courses/Fall20/sta523/slides/lecture/lec_01.html#9

Conclusion

  • After going over these basic concepts we are ready to learn how to simulate coin tosses from probability distributions, revisit Central Limit Theorem (my favorite after Evolution), relearn Hypothesis Testing, type 1 and type 2 errors.

References

  • https://www.geeksforgeeks.org/classes-in-r-programming/

  • https://rpubs.com/Thinklabz/data_types_and_objects

  • https://caml.inria.fr/pub/docs/oreilly-book/html/book-ora140.html