Learning a scripting language for statistical analysis is important for reproducible and documentable, workflows.
In this set of lectures we will start learning basics behind R, the programming language we will be working with in this course.
R is written with C, Fortran and R itself.
R contains a lot of overhead, one way to write faster running code is to be able to write code that is sent to C with minimal R overhead.
It is a high level declarative (mainly) language.
The order you write a line of code matters.
Before we dive into statistical concepts we need to clarify a set of concepts.
Classes, Objects, Methods, functions, data structures and types, for loops, vectorization and broadcasting are the main topics we will discuss.
You will not have to memorize any of these concepts for multiple choice tests!
Classes are descriptions of how data is organized and what methods are available to call from it.
Examples of classes are dataframe and matrix which are a rectangular form that can hold different types of data.
Objects are instances of classes.
Functions are also a set of instructions but are not necessarily associated with objects.
Below, the rnorm function generates 5 normally distributed with mean (\(\mu\)) 0 and standard deviation (\(\sigma\)) 1.
In R many of the functions have default parameters such as the rnorm has mean and standard deviation as 0 and 1.
To change it, we need to specify new parameter values within the parenthesis of the function.
Data in R can be either atomic, containing only one type of data limited to character, logical, numeric, integer (vectors, matrix, arrays)
or it can be recursive (lists, data frames) that can contain any type of data.
Coerce object y to be an integer and compare sizes.
[1] TRUE
[1] FALSE
96 bytes
80 bytes
How about when we increase the number of values from 5 to 10.
Recall that when we had 5 single digit numbers, the object defined with double was 96 bytes and the object defined with integers is 80 bytes. When we increase the size of these objects to 10 single digit numbers the respective sizes jump to 176 and and 96. This is a good reason why knowing about these differences are important when you are building models with large/big data.
[1] FALSE FALSE TRUE FALSE TRUE
80 bytes
[1] 2
Atomic classes that we will not use in this class.
Complex: A number with an imaginary component. For instance “i”.
https://www.geeksforgeeks.org/imaginary-numbers/
https://stat.ethz.ch/R-manual/R-devel/library/base/html/raw.html
Vectors: Single dimensional collection of data in the same type
Lists: Single dimensional collection of data in any type.
Arrays: Single or more dimensional data of same type
Matrices: Two dimensional arrays.
DataFrames: Two dimensional table composed of vectors
Factors: Factor is a function that can create a set of fixed number of levels in vector(s) which can be a part of a dataframe. https://www.geeksforgeeks.org/r-factors/ https://r4ds.hadley.nz/factors.html
num_rows=100
data_object=as.data.frame(matrix(nrow=num_rows,ncol=2))
data_object[,1]=rnorm(n=num_rows,mean=0,sd=1)
data_object[,2]=data_object[,1]*0.7+rnorm(n=num_rows,mean=0,sd=1)
names(data_object)=c("Y","X")
regression_object=lm(Y~X,data=data_object)
regression_object$coefficients
(Intercept) X
0.1164081 0.4317827
[1] 0.2675999
Imperative vs Declarative (ignoring functional and object oriented)
Definitions for language types are not black and white and most languages have a bit of both.
– for(i in START:END){ i is the index variable it will take values from START (say 1) to the value END (say 10) represents values that i is going to take, unless otherwise scripted, iterations happen by 1 unit at a time }
A for loop in declares a variable to index. In the example below this is “i”.
Everything within brackets is run for the times specified within the parenthesis. In the example below we specify there will be a total of 3 runs.
As the statements run, the value “i” changes from 1 to 3.
The data frame ccFraud contains the dependent variable in the very last column, rest are independent variables.
AIC is defined as a single dimension array with the number of elements equivalent to the number of independent variables.
We use as independent variables each column separately, run a logistic regression and obtain model comparison measure AIC, store it in the object AIC.
#AIC object is defined
AIC=array(dim=c(ncol(ccFraud)-1))
END=ncol(ccFraud)-1
#For loop is going to run from i being equal to i to number of columns - 1.
for(i in 1:END){
#Store summary logistic regression information
#The log_regression object is going to be rewritten at every iteration.
log_regression= summary(glm(Fraud~ccFraud[,i],family=binomial(link='logit'),data=ccFraud))
#AIC is populated at i'th element
AIC[i]=log_regression$aic
}
“Vectorization is the process of transforming a scalar operation acting on individual data elements (Single Instruction Single Data—SISD) to an operation where a single instruction operates concurrently on multiple data elements (SIMD)” https://www.sciencedirect.com/topics/computer-science/vectorization#
So what happens in the chunk of code above? A vector is added to another vector (in r, a scalar is a vector)
1 | 3 | 5 |
---|---|---|
+ | + | + |
6 | 4 | 2 |
= | = | = |
7 | 7 | 7 |
How do we calculate residuals in regression (Observed - Predicted)
This is a vectorized operation, operations are made in parallel.
numcols=10000
numrows=1000
system.time({rho_vec=matrix(data=rnorm(n=numcols*numrows,mean=0,sd=1),nrow=numrows,ncol=numcols)})
user system elapsed
0.38 0.02 0.39
rho_for=matrix(nrow=numrows,ncol=numcols)
system.time(for(i in 1:numcols){rho_for[,i]=rnorm(n=numrows,mean=0,sd=1)}
)
user system elapsed
0.33 0.04 0.38
When you operate on two objects, matching the dimension of one object to the other for the operations is called broadcasting. Recycling works in a similar fashion however rather than matching the dimensions of two objects, the smaller object is copied until it has the same dimension as the larger object.
Base R uses recycling https://stackoverflow.com/questions/42893238/recycling-higher-dimensional-arrays
1 | 3 | 5 |
---|---|---|
+ | + | + |
1 | 1 | 1 |
= | = | = |
2 | 4 | 6 |
Adding two vectors of sizes that do not conform to each other. Objects a and b are vectors.
Warning in a + b: longer object length is not a multiple of shorter object
length
[1] 6 10 10
1 | 3 | 5 |
---|---|---|
+ | + | + |
5 | 7 | 5 |
= | = | = |
6 | 10 | 10 |
How does this work with multiplication?
Warning in b * X: longer object length is not a multiple of shorter object
length
[1] 25 70 10 -35 15
5 | 7 | 5 | 7 | 5 |
---|---|---|---|---|
\(\times\) | \(\times\) | \(\times\) | \(\times\) | \(\times\) |
5 | 10 | 2 | -5 | 3 |
= | = | = | = | = |
25 | 70 | 10 | -35 | 15 |
Create 2, 3 dimensional arrays with each dimension having 2,3 and 2 as sizes of these dimensions. We do not specify any data in the arrays therefore it will have NA values.
Print a_1 before we populate it.
We will populate the arrays arbitrarily with a nested for loop structure.
What are the values populating the arrays?
The object a_1 has the following elements when k=1 and k=2:
Add to the array, vector d.
The vector d is recycled by filling columns to conform to the dimensions of a_1 the following structure for k=1 and k=2:
1 | 10 | 5 |
5 | 1 | 10 |
Resulting in:
The result is:
What happens when we add two arrays which do not have the same dimensions.
If we convert object d to an array, it does not get recycled from a dimension of 1 to dimensions of 3 and you get an error.
[1] 1
[1] 3
Error in a_1 + d: non-conformable arrays
Unlike vectors, smaller arrays are not recycled to conform to the dimensions of the larger object.
Coercion is the mutation of the type or structure of a data/object due to operations done on them.
We actually have already seen an example of coercion.
[1] TRUE FALSE TRUE TRUE TRUE
[1] 4
https://www2.stat.duke.edu/courses/Fall20/sta523/slides/lecture/lec_01.html#9
https://www.geeksforgeeks.org/classes-in-r-programming/
https://rpubs.com/Thinklabz/data_types_and_objects
https://caml.inria.fr/pub/docs/oreilly-book/html/book-ora140.html