Data Science - Coursera

Notas para la futura Alejandra. Recuerda SIEMPRE estos pasos: Guardar->Commit->Push (asi se hayan desaparecido los archivos) y ya. Puedes usar esta pagina donde estan las referencias de R Markdown. Se puede usar Esta pagina para cosas utiles.

Markdown Sheets
Markdown Info
R Cheat Sheets

Para borrar la consola cntrl+L y para borrar todas las variables rm(list=ls()). Para tener informacion de una funcion se pone “?funcion”

There are different types of classes of objetcs. R has five basic or atomic classes of objects:

R objects can have attributes which are like metadata. Attributes can be accessed using attributes()

Para hacer Debugging:


Generalities

Vectores, Matrices, listas, factores, dataframe…

Intro

  • Functions
    • c()
    • vector()
    • names()
    • class()
    • dimnames()
    • list()
    • factors()
    • levels()
    • is.na()/ is.nan()

Vectors

Vectors can have only one class of objects, when different objects are mixed in a vector, coercion occurs so that every element in the vector is of the same class. To create a vector:

  • Con la funcion c()
x<- c(0.5, 0.6) #Como si c significara concatenar
print(x)
> [1] 0.5 0.6
  • Tambien se le pueden poner nombres a las columnas
names(x)<-c("Food","Bar")
x
> Food  Bar 
>  0.5  0.6

Pueden haber distintos objetos (numericos, logicos, enteros, etc)

  • Con la funcion vector()
x<-vector("numeric", length=10)
x
>  [1] 0 0 0 0 0 0 0 0 0 0

Also, different modes can be used such as “logical”, “integer”, “numeric” (synonym “double”), “complex”, “character” and “raw”.
*** Para saber la clase entonces se usa class()

class(x)
> [1] "numeric"
  • Se puede forzar la conversion entre clases de numeros. Por ejemplo:
x<-0:6
class(x)
> [1] "integer"
as.numeric(x)
> [1] 0 1 2 3 4 5 6
as.character(x)
> [1] "0" "1" "2" "3" "4" "5" "6"
as.logical(x)
> [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
  • Vectors dont have dimentions. However, can be assigned:
my_vector <-1:20  
dim(my_vector)
> NULL
dim(my_vector) <- c(4,5) #rows and cols
my_vector # now is a matrix
>      [,1] [,2] [,3] [,4] [,5]
> [1,]    1    5    9   13   17
> [2,]    2    6   10   14   18
> [3,]    3    7   11   15   19
> [4,]    4    8   12   16   20

Lists

Are a special type of vectors that contain elements of different classes. To create one:

  • Using list()
x<-list(1, "a", TRUE, 1+4i) #Recuerda que los numeros que cambian son los indices de los vectores, tienen dobles corchetes cuadrados
print(x)
> [[1]]
> [1] 1
> 
> [[2]]
> [1] "a"
> 
> [[3]]
> [1] TRUE
> 
> [[4]]
> [1] 1+4i
  • Create an empty list with vector()
x <- vector("list", length = 5)

Tambien pueden tener nombres

x<-list(a=1,b= "a",c= TRUE,d= 1+4i)
x
> $a
> [1] 1
> 
> $b
> [1] "a"
> 
> $c
> [1] TRUE
> 
> $d
> [1] 1+4i

Para eliminar una lista se puede usar unlist()

  • To access a list $ is used
x$b
> [1] "a"

Matrices

Matrices are vectores with a dimmension attribute. Se construyen column wise es decir, se llenan las filas de arriba para abajo una columna a la vez.

m<- matrix(1:6, nrow=2, ncol=3) # Se construye column wise
m
>      [,1] [,2] [,3]
> [1,]    1    3    5
> [2,]    2    4    6
dim(m)
> [1] 2 3
attributes(m) #Todo lo que se le pueda sacar
> $dim
> [1] 2 3
m[1,] #Para extraer solo la primera fila
> [1] 1 3 5
  • Pueden tener nombres. Also rownames() or colnames() can be used
dimnames(m) <- list(c("fila1","fila2"),c("col1","col2","col3")) #Se usa una listaaa
m
>       col1 col2 col3
> fila1    1    3    5
> fila2    2    4    6
  • Otra forma es redimensionar un vector
m<-1:10
dim(m) <- c(2,5) #dos filas, dos columnas
m
>      [,1] [,2] [,3] [,4] [,5]
> [1,]    1    3    5    7    9
> [2,]    2    4    6    8   10
  • Cbind-ing and rbind-ing
x<-1:3
y<-10:12
cbind(x,y) # Los une por columnas
>      x  y
> [1,] 1 10
> [2,] 2 11
> [3,] 3 12
print("Tambien se puede unir por filas")
> [1] "Tambien se puede unir por filas"
rbind(x,y)
>   [,1] [,2] [,3]
> x    1    2    3
> y   10   11   12

Factors

One can think of a factor as an integer vector where each integer has a label. Factors are automatically created when read.table() is used. Para tener datos categoricos:

x<-factor(c("yes","no","yes", "no"), levels = c("yes","no")) #Para determinar el orden de los niveles.
x
> [1] yes no  yes no 
> Levels: yes no
table(x) #Conteo
> x
> yes  no 
>   2   2
class(x)
> [1] "factor"

Data Frame

Are used to stored tabular data in R. dplyr is a package optimized for data frames. Are represented as a special type of list where every element of the list has to have the same length. Can store different classes. Generally are read using functions such as read.table() or read.csv()

x<-data.frame(food=1:4, bar=c(T,T,F,F))
x
>   food   bar
> 1    1  TRUE
> 2    2  TRUE
> 3    3 FALSE
> 4    4 FALSE
nrow(x)
> [1] 4
ncol(x)
> [1] 2
  • Para cambiar el nombre de las columnas. names() can be used tu columns
colnames(x)[c(1,2)]<-c("Name","Mortality"); x
>   Name Mortality
> 1    1      TRUE
> 2    2      TRUE
> 3    3     FALSE
> 4    4     FALSE
x<-data.frame(food=1:4, bar=c(T,T,F,F))
names(x) <- c("Name","Mortality")

To set the names in the rows use row.names()

Missing values

Missing values are denoted by NA or NaN for undefined mathematical operations.

  • is.na() is used to test objects if they are NA
  • is.nan() is used to test for NaN
  • NA values have a class also, so there are integer NA, character NA, etc.
  • NaN value is also NA but the converse is not true

Subsetting

Intro

  • The [ operator always returns an object of the same class as the original. It can be used to select multiple elements of an object

  • The [[ operator is used to extract elements of a list or a data frame. It can only be used to extract a single element and the class of the returned object will not necessarily be a list or data frame.

  • The $ operator is used to extract elements of a list or data frame by literal name. Its semantics are similar to that of [[.

  • Functions

    • c()
    • vector()
    • names()
    • class()
    • dimnames()
    • list()
    • factors()
    • levels()
    • is.na()/ is.nan()
    • identical()

%in%

in

v1 <- 3
v2 <- 101
t <- c(1,2,3,4,5,6,7,8)
v1 %in% t
> [1] TRUE
t %in% v1 #Its better big first.
> [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

Vector

x <- c("a", "b", "c", "c", "d", "a")
x[1] ## Extract the first element
> [1] "a"
x[c(1, 3, 4)] # The sequences doesnt has to be in order
> [1] "a" "c" "c"
x[-c(1, 3, 4)] # All elements but those
> [1] "b" "d" "a"

We can also pass a logical sequence to the [ operator to extract elements of a vector that satisfy a given condition. For example, here we want the elements of x that come lexicographically after the letter “a”.

x[x > "a"]
> [1] "b" "c" "c" "d"
  • Compare if two vectors are the same
identical(x,c(1,2))
> [1] FALSE

Matrix

x <- matrix(1:6, 2, 3)
x[1, 2]
> [1] 3
x[1, , drop = FALSE] #if the answer has to be a matrix
>      [,1] [,2] [,3]
> [1,]    1    3    5

Lists

  • Normal lists
x <- list(foo=1:4, bar=0.6) #lista
x[1] #Lista
> $foo
> [1] 1 2 3 4
x[[1]]
> [1] 1 2 3 4
class(x[2])
> [1] "list"
class(x[[2]])
> [1] "numeric"
x$foo
> [1] 1 2 3 4
x[[1]][[2]] #extraer un elemento de la lista
> [1] 2
  • Nested Lists
x <- list(a = list(10, 12, 14), b = c(3.14, 2.81)); x
> $a
> $a[[1]]
> [1] 10
> 
> $a[[2]]
> [1] 12
> 
> $a[[3]]
> [1] 14
> 
> 
> $b
> [1] 3.14 2.81
## Get the 3rd element of the 1st element
x[[c(1, 3)]] #or x[[1]][[3]]
> [1] 14

Estructuras de control

Intro

  • if and else: testing a condition and acting on it
  • for: execute a loop a fixed number of times
  • while: execute a loop while a condition is true
  • repeat: execute an infinite loop (must break out of it to stop)
  • break: break the execution of a loop
  • next: skip an interation of a loop

If Else

condition<-TRUE
if(condition==TRUE) {
  ## do something
  print(TRUE)
} else #else if 
  { print(FALSE)
  ## do something else
}
> [1] TRUE

for

Se puede usar next para cambiar de iteracion o para salir.

for(i in 1:10){
  print(i)
}
> [1] 1
> [1] 2
> [1] 3
> [1] 4
> [1] 5
> [1] 6
> [1] 7
> [1] 8
> [1] 9
> [1] 10
## otra forma de escribirlo
for(i in 1:4) print(i)
> [1] 1
> [1] 2
> [1] 3
> [1] 4
## Para crear una sequencia
seq(5)
> [1] 1 2 3 4 5

While

Se puede reemplazar con un repeat pero hay que usar un break

count<-0
while(count<10) {
  print(count)
  count<-count+1
}
> [1] 0
> [1] 1
> [1] 2
> [1] 3
> [1] 4
> [1] 5
> [1] 6
> [1] 7
> [1] 8
> [1] 9

Funciones

Intro

To see the source code , just type the function without parenthesis. To see the arguments type args(function)

  • seq()
  • seq_along()
  • Funciones logicas
    • isTRUE()
    • xor()
    • all()
    • any()
  • which()

Forma general

above_10 <- function (x, n=10){
  use <- x>n
  x[use]
  # Se puede especificar default values.
}
above_10(c(1,4,6,8,99))
> [1] 99

Funciones Anonimas

evaluate <- function(func, dat){func(dat)}
evaluate(function(x){x[1]}, c(8, 4, 0)) #return the first element of the vector
> [1] 8
evaluate(function(x){x[length(x)]}, c(8, 4, 0))  #return the last element of the vector
> [1] 0

Sequences

How to create a sequence

seq(1,10, by=2)
> [1] 1 3 5 7 9
seq(1,10,length=3)
> [1]  1.0  5.5 10.0
seq(along=c(1,3,8,25,100)) #create a vector with the length of the vector
> [1] 1 2 3 4 5
seq_along(c(1,3,8,25,100)) #Same
> [1] 1 2 3 4 5

Funciones logicas

The & is first evaluated than |. && or || evaluates only the first argument. * isTRUE()

isTRUE(3>2)
> [1] TRUE
  • The xor() function stands for exclusive OR. If one argument evaluates to TRUE and one argument evaluates to FALSE, then this function will return TRUE, otherwise it will return FALSE.
xor(5==6, !FALSE)
> [1] TRUE
xor(TRUE, TRUE)
> [1] FALSE
  • The any() function will return TRUE if one or more of the elements in the logical vector is TRUE. The all() function will return TRUE if every element in the logical vector is TRUE.
ints <- sample(10)
any(ints<0)
> [1] FALSE
all(ints>0)
> [1] TRUE

Index

  • Para encontrar los indices se usa which()
ints <-sample(10)
which(ints>5)
> [1] 1 4 5 6 9

Looping functions

Intro

  • lapply()
  • sapply()
  • apply()
  • mapply() vectorize()
  • tapply()
  • split()

lapply

The lapply() function does the following simple series of operations:
1. it loops over a list, iterating over each element in that list
2. it applies a function to each element of the list (a function that you specify)
3. and returns a list (the l is for “list”).

x<-list(a=matrix(1:4, 2,2), b=matrix(1:6,3,2)); x
> $a
>      [,1] [,2]
> [1,]    1    3
> [2,]    2    4
> 
> $b
>      [,1] [,2]
> [1,]    1    4
> [2,]    2    5
> [3,]    3    6
lapply(x, mean)
> $a
> [1] 2.5
> 
> $b
> [1] 3.5

When you pass a function to lapply(), lapply() takes elements of the list and passes them as the first argument of the function you are applying.

Si se tiene una lista, se va a aplicar la funcion a toda la lista.

library(datasets)
data(iris)
iris[1:5,]
>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
> 1          5.1         3.5          1.4         0.2  setosa
> 2          4.9         3.0          1.4         0.2  setosa
> 3          4.7         3.2          1.3         0.2  setosa
> 4          4.6         3.1          1.5         0.2  setosa
> 5          5.0         3.6          1.4         0.2  setosa
lapply(iris,mean)
> Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
> returning NA
> $Sepal.Length
> [1] 5.843333
> 
> $Sepal.Width
> [1] 3.057333
> 
> $Petal.Length
> [1] 3.758
> 
> $Petal.Width
> [1] 1.199333
> 
> $Species
> [1] NA
lapply(x, function(elt) { elt[,1] }) #Elt is the argument, i.e the list
> $a
> [1] 1 2
> 
> $b
> [1] 1 2 3

You can put an arbitrarily complicated function definition inside lapply(), but if it’s going to be more complicated, it’s probably a better idea to define the function separately.

sapply

The sapply() function behaves similarly to lapply(); the only real difference is in the return value. sapply() will try to simplify the result of lapply() if possible. Essentially, sapply() calls lapply() on its input and then applies the following algorithm:

  • If the result is a list where every element is length 1, then a vector is returned
  • If the result is a list where every element is a vector of the same length (> 1), a matrix is returned.
  • If it can’t figure things out, a list is returned
x<-list(a=matrix(1:4, 2,2), b=matrix(1:6,3,2)); x
> $a
>      [,1] [,2]
> [1,]    1    3
> [2,]    2    4
> 
> $b
>      [,1] [,2]
> [1,]    1    4
> [2,]    2    5
> [3,]    3    6
sapply(x, mean)
>   a   b 
> 2.5 3.5
library(datasets)
s <- split(airquality, airquality$Month)
sapply(s, function(x) {colMeans(x[, c("Ozone", "Solar.R", "Wind")],na.rm = TRUE)})
>                 5         6          7          8         9
> Ozone    23.61538  29.44444  59.115385  59.961538  31.44828
> Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333
> Wind     11.62258  10.26667   8.941935   8.793548  10.18000

apply

The apply() function is used to a evaluate a function (often an anonymous one) over the margins of an array. It is most often used to apply a function to the rows or columns of a matrix (which is just a 2-dimensional array).

str(apply)
> function (X, MARGIN, FUN, ...)

Por ejemplo:

x<-matrix(rnorm(200),20,10)
apply(x,2,mean) #dimension 2 is cols. El promedio se hace vertical
>  [1]  0.17307975  0.07813620 -0.21273428  0.35048079  0.01849736  0.17894928
>  [7] -0.26329213  0.19919152 -0.16809101  0.03871802
apply(x,1,sum)  #Sum through the rows 
>  [1] -4.36190992  5.88912273 -3.74986586 -0.46952941 -2.26029643 -1.72725736
>  [7]  4.79822344 -4.47910605 -0.06277095 -0.80749994  0.88526970  2.36069841
> [13] -2.47805374  2.36070808 -1.20829089  0.74987462  4.37266752  2.53142014
> [19]  4.36210080  1.15320514

Hay funciones como rowsum, o colsum que pueden reducir el codigo.

mapply

The mapply() function is a multivariate apply of sorts which applies a function in parallel over a set of arguments. Recall that lapply() and friends only iterate over a single R object.

str(mapply)
> function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
  • FUN is a function to apply
  • … contains arguments to apply over
  • MoreArgs is a list of other arguments to FUN
  • SIMPLIFY indicates whether the result should be simplyfied

Por ejemplo, en vez de usar list(rep(1,4), rep(2,3), rep(3,2), rep(4,1)) se puede usar:

mapply(rep,1:4,4:1)
> [[1]]
> [1] 1 1 1 1
> 
> [[2]]
> [1] 2 2 2
> 
> [[3]]
> [1] 3 3
> 
> [[4]]
> [1] 4
  • mapply() can be used to vectorize a function:
sumsq <- function(mu, sigma, x) {sum(((x - mu) / sigma)^2)}
x <- rnorm(100) ## Generate some data
sumsq(1:10, 1:10, x) #not the result that we want
> [1] 104.2627
mapply(sumsq, 1:10, 1:10, MoreArgs = list(x = x)) #We can use also vectorize
>  [1] 218.2807 136.8669 119.6283 112.8651 109.4011 107.3393 105.9878 105.0405
>  [9] 104.3429 103.8096

tapply

tapply() is used to apply a function over subsets of a vector. It can be thought of as a combination of split() and sapply() for vectors only. I’ve been told that the “t” in tapply() refers to “table”, but that is unconfirmed.

str(tapply)
> function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)
  • X is a vector
  • INDEX is a factor or a list of factors
  • FUN is a function to apply
  • … contains arguments to apply over
  • SIMPLIFY indicates whether the result should be simplyfied
x<-c(rnorm(10), runif(10),rnorm(10))
f<-gl(3,10) #Genera los niveles de los factores gl(n,k) n es el numero de niveles y k el numero de replicas
tapply(x,f,mean)
>          1          2          3 
>  0.2826848  0.3490682 -0.1937033
data("mtcars")
mtcars[1:5,]
>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
tapply(mtcars$mpg,mtcars$cyl,mean)
>        4        6        8 
> 26.66364 19.74286 15.10000

split

split takes a vector or other objetc and splits it into groups determined by a factor of list of factors

str(split)
function(x, f, drop=FALSE,... )
  • x is a vector or data frame
  • f is a factores or a list of factors
  • drop indicates whether empty factors levels should be dropped
x<-c(rnorm(10), runif(10),rnorm(10))
f<-gl(3,10) #Genera los niveles de los factores gl(n,k) n es el numero de niveles y k el numero de replicas
split(x,f) #siempre devuelve una lista. Se puede usar con otra function
> $`1`
>  [1]  0.008569759  0.477261567  1.931180644 -0.936025291 -2.179742864
>  [6]  0.114505524 -1.050862210 -0.096015922 -1.214242667 -0.199772953
> 
> $`2`
>  [1] 0.9795859 0.4812531 0.3831788 0.6900134 0.7154081 0.9284080 0.1054450
>  [8] 0.9040322 0.3522041 0.9960455
> 
> $`3`
>  [1] -1.19798852 -0.86591087 -2.21173945  1.15170942  1.27622749  0.32057739
>  [7]  0.97449109  0.49393589  0.06746323  1.23617451
lapply(split(x,f),mean)
> $`1`
> [1] -0.3145144
> 
> $`2`
> [1] 0.6535574
> 
> $`3`
> [1] 0.124494
  • Usefull example
library(datasets)
s <- split(airquality, airquality$Month)
sapply(s, function(x) {colMeans(x[, c("Ozone", "Solar.R", "Wind")],na.rm = TRUE)})
>                 5         6          7          8         9
> Ozone    23.61538  29.44444  59.115385  59.961538  31.44828
> Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333
> Wind     11.62258  10.26667   8.941935   8.793548  10.18000

Regular Expressions - Texto

Intro

  • Texto
    • paste()
    • strcmp()
    • tolower()-toUpper()
    • strsplit()
  • grep() grepl()
  • sapply()
  • apply()
  • mapply() vectorize()
  • tapply()
  • split()

Texto - General

To read text files use readLines(...txt) * Para convertir vectores a caracteres unidos

paste(as.character(c(rep(0,times=3-length(1)),1)), collapse ="")
> [1] "001"
paste0(as.character(c(rep(0,times=3-length(1)),1)), collapse = "")
> [1] "001"
  • Para comparar dos strings
library(pracma)
strcmpi("Hola","hola") #case insensitive
> [1] TRUE
strcmp("Hola","hola")  #case sensitive
> [1] FALSE
  • Para poner todas las letras en minuscula
tolower("HOLA")
> [1] "hola"
  • Para separar de acuerdo a un indicador
testName <- "HOLA.Como.Estas"
strsplit(testName,"\\.")
> [[1]]
> [1] "HOLA"  "Como"  "Estas"
  • Para sustituir un caracter
gsub("[[:punct:]]"," ",testName) #removes all puntuaction
> [1] "HOLA Como Estas"
gsub("[.]"," ",testName) #tambien se puede [-]
> [1] "HOLA Como Estas"

Metacharacters:

Metacharacters * ^ para buscar al inicio * $ final * [Bb] permitir las mayusculas y minusculas. * ? Para hasta el primer match. Ver regexpr()

library(datasets)
grep("^New", state.name) #Index
> [1] 29 30 31 32
grep("^New", state.name, value = TRUE) #Values
> [1] "New Hampshire" "New Jersey"    "New Mexico"    "New York"

grep() - grepl()

grep() is used to match a label into a character vector. Returns a vector of the indices of the elements of x that yielded a match. By setting value = TRUE the output is the real value.

> function (pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, 
>     fixed = FALSE, useBytes = FALSE, invert = FALSE)
g<- grep("Cause: shooting", homicides);length(g)
> [1] 228
g<- grep("Cause: [Ss]hooting", homicides);length(g) #capital insensitive
> [1] 1003

The function grepl() works much like grep() except that it differs in its return value. grepl() returns a logical vector indicating which element of a character vector contains the match.

> function (pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, 
>     useBytes = FALSE)
g<-grepl("^New", state.name);g
>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> [25] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
> [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> [49] FALSE FALSE
state.name[g]
> [1] "New Hampshire" "New Jersey"    "New Mexico"    "New York"

regexpr() - regexec()

> function (pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, 
>     useBytes = FALSE)

The regexpr() function gives you the (a) index into each string where the match begins and the (b) length of the match for that string. regexpr() only gives you the first match of the string (reading left to right). gregexpr() will give you all of the matches in a given string if there are is more than one match.

regexpr("<dd>[F|f]ound(.*)</dd>", homicides[1]); r<-regexpr("<dd>[F|f]ound(.*)</dd>", homicides[1])
> [1] 177
> attr(,"match.length")
> [1] 93
> attr(,"index.type")
> [1] "chars"
> attr(,"useBytes")
> [1] TRUE
regmatches(homicides[1],r)
> [1] "<dd>Found on January 1, 2007</dd><dd>Victim died at Shock Trauma</dd><dd>Cause: shooting</dd>"
regexpr("<dd>[F|f]ound(.*?)</dd>", homicides[1]); r<-regexpr("<dd>[F|f]ound(.*?)</dd>", homicides[1])
> [1] 177
> attr(,"match.length")
> [1] 33
> attr(,"index.type")
> [1] "chars"
> attr(,"useBytes")
> [1] TRUE
regmatches(homicides[1],r)
> [1] "<dd>Found on January 1, 2007</dd>"

The regexec() function works like regexpr() except it gives you the indices for parenthesized subexpressions.

r<-regexec("<dd>[F|f]ound on (.*?)</dd>", homicides[1])
regmatches(homicides[1], r)
> [[1]]
> [1] "<dd>Found on January 1, 2007</dd>" "January 1, 2007"

gsub() - sub()

sub and gsub return a character vector of the same length and with the same attributes as x (after possible coercion to character).

x <- substr(homicides[1], 177, 177 + 33 - 1); x
> [1] "<dd>Found on January 1, 2007</dd>"
sub("<dd>[F|f]ound on |</dd>", "", x)
> [1] "January 1, 2007</dd>"
gsub("<dd>[F|f]ound on |</dd>", "", x)
> [1] "January 1, 2007"

Useful functions

nchar()

nchar Takes a character vector as an argument and returns a vector whose elements contain the sizes of the corresponding elements of x. Its really fast. Returns a data.frame in a vector is introduced.

x <- c("asfef", "qwerty", "yuiop[", "b", "stuff.blah.yech")
nchar(x)
> [1]  5  6  6  1 15

Data Tables - Data Frames

Intro

  • subset()

subsetting

Creates a subset of a data table

library(datasets)
head(subset(iris, Species=="setosa"))
>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
> 1          5.1         3.5          1.4         0.2  setosa
> 2          4.9         3.0          1.4         0.2  setosa
> 3          4.7         3.2          1.3         0.2  setosa
> 4          4.6         3.1          1.5         0.2  setosa
> 5          5.0         3.6          1.4         0.2  setosa
> 6          5.4         3.9          1.7         0.4  setosa
class(subset(iris, Species=="setosa"))
> [1] "data.frame"

Folders and Files

Intro

  • download.file()
  • read.table()

Files

Connection with outside world

  • file, opens a connection to a file
  • url, opens a connection to url
  • gzfile and bzfile opens a connection with compresed files
  • file.exist Check if files exist. Returns TRUE or FALSE
  • dir.create Create directory if it doesnt exist.
  • download.file
> function (url, destfile, method, quiet = FALSE, mode = "w", cacheOK = TRUE, 
>     extra = getOption("download.file.extra"), headers = NULL, ...)

Ejemplo

url <-"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
destfile <-"Ejercicio_Semana3/US_communities.csv"
download.file(url, destfile)

Reading files

Hay diferentes funciones para leer los datos:

  • read.table, read.csv Son iguales, la unica difrencia es que read.table el separador por defecto es un espacio y en read.csv es una coma
  • readLines, para leer lineas de texto
  • source, Para leer R code files
  • dget, para leer R code files
  • load, para cargar los datos
  • unserialize, para leer en forma binaria
  • read.table() not so useful for big data.
    • If you are trying to read a .csv file you should use sep =",", header=TRUE.
    • na.strings set the character that represents a missing value
    • skip number of lines to skip before starting to read
    • Cuando se van a leer bases de datos muy grandes, se pueden hacer unas optimizaciones:
    • Poner comment.char="" si no hay comentarios en el archivo
    • Especificar colClasses. Una forma de encontrar las clases de cada columna es:
 initial<- read.table("Nombredelarchivo.txt", nrows=100)  
 classes <- sapply(initial,class)  
 tabAll< read.table("Nombredelarchivo.txt", colClasses=classes) 
+ Especificar nrows, se puede sobre estimar  
+ Puede hacer el calculo de la memoria que va a gastar.
  • reading excel files
    • library(xlsx) es un paquete para leer excel. Tambien XLConnect tiene mas opciones
    • Se puede usar la funcion read.xlsx. Esta funcion permite leer columnas y filas especificas.
    • para crear un archivo write.xlsx
  • data.table es un paquete
    • library(data.table)
    • tables() todas las tablas del documento
    • Importante Para crear una copia de una tabla se debe hacer uso de la funcion copy

dplyr Package

Intro

Its useful for working with data frames.

  • select()
  • filter()
  • arrange()
  • rename()
  • mutate()
  • group_by()
  • merge()

Select

> function (.data, ...)
  • Para ver los datos de columnas
head(select(iris, Sepal.Length:Petal.Length)) 
>   Sepal.Length Sepal.Width Petal.Length
> 1          5.1         3.5          1.4
> 2          4.9         3.0          1.4
> 3          4.7         3.2          1.3
> 4          4.6         3.1          1.5
> 5          5.0         3.6          1.4
> 6          5.4         3.9          1.7
  • Para omitir los datos de una columna
head(select(iris, -Sepal.Length))
>   Sepal.Width Petal.Length Petal.Width Species
> 1         3.5          1.4         0.2  setosa
> 2         3.0          1.4         0.2  setosa
> 3         3.2          1.3         0.2  setosa
> 4         3.1          1.5         0.2  setosa
> 5         3.6          1.4         0.2  setosa
> 6         3.9          1.7         0.4  setosa

Filter

Subset rows given a condition.

> function (.data, ..., .preserve = FALSE)
head(filter(iris, Petal.Width>0.2))
>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
> 1          5.4         3.9          1.7         0.4  setosa
> 2          4.6         3.4          1.4         0.3  setosa
> 3          5.7         4.4          1.5         0.4  setosa
> 4          5.4         3.9          1.3         0.4  setosa
> 5          5.1         3.5          1.4         0.3  setosa
> 6          5.7         3.8          1.7         0.3  setosa

Arrange

Order the rows

> function (.data, ..., .by_group = FALSE)
head(arrange(iris, Petal.Width)) #se puede usar desc(Petal.Width) para el orden descendiente
>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
> 1          4.9         3.1          1.5         0.1  setosa
> 2          4.8         3.0          1.4         0.1  setosa
> 3          4.3         3.0          1.1         0.1  setosa
> 4          5.2         4.1          1.5         0.1  setosa
> 5          4.9         3.6          1.4         0.1  setosa
> 6          5.1         3.5          1.4         0.2  setosa
tail(arrange(iris, Petal.Width))
>     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
> 145          5.8         2.8          5.1         2.4 virginica
> 146          6.3         3.4          5.6         2.4 virginica
> 147          6.7         3.1          5.6         2.4 virginica
> 148          6.3         3.3          6.0         2.5 virginica
> 149          7.2         3.6          6.1         2.5 virginica
> 150          6.7         3.3          5.7         2.5 virginica

Rename

Rename cols

> function (.data, ...)
head(rename(iris, Petal_Width=Petal.Width)) #se puede usar desc(Petal.Width) para el orden descendiente
>   Sepal.Length Sepal.Width Petal.Length Petal_Width Species
> 1          5.1         3.5          1.4         0.2  setosa
> 2          4.9         3.0          1.4         0.2  setosa
> 3          4.7         3.2          1.3         0.2  setosa
> 4          4.6         3.1          1.5         0.2  setosa
> 5          5.0         3.6          1.4         0.2  setosa
> 6          5.4         3.9          1.7         0.4  setosa

Mutate

Create or adds new variables

> function (.data, ...)
head(mutate(iris, Petal_std=Petal.Width-mean(Petal.Width))) #se puede usar desc(Petal.Width) para el orden descendiente
>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  Petal_std
> 1          5.1         3.5          1.4         0.2  setosa -0.9993333
> 2          4.9         3.0          1.4         0.2  setosa -0.9993333
> 3          4.7         3.2          1.3         0.2  setosa -0.9993333
> 4          4.6         3.1          1.5         0.2  setosa -0.9993333
> 5          5.0         3.6          1.4         0.2  setosa -0.9993333
> 6          5.4         3.9          1.7         0.4  setosa -0.7993333

Group_by - ddply

Group by categories. Es una especie de tabla dinamica.

> function (.data, ..., .add = FALSE, .drop = group_by_drop_default(.data))
group<-group_by(iris, Species) #se puede usar desc(Petal.Width) para el orden descendiente
summarise(group, Petal_meanWidth = mean(Petal.Width), Max_Sepal_length = max(Sepal.Length))
> `summarise()` ungrouping output (override with `.groups` argument)
> # A tibble: 3 x 3
>   Species    Petal_meanWidth Max_Sepal_length
>   <fct>                <dbl>            <dbl>
> 1 setosa               0.246              5.8
> 2 versicolor           1.33               7  
> 3 virginica            2.03               7.9

ddply() Split Data Frame, Apply Function, And Return Results In A Data Frame.

library(plyr)
> Warning: package 'plyr' was built under R version 4.0.2
> ------------------------------------------------------------------------------
> You have loaded plyr after dplyr - this is likely to cause problems.
> If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
> library(plyr); library(dplyr)
> ------------------------------------------------------------------------------
> 
> Attaching package: 'plyr'
> The following objects are masked from 'package:dplyr':
> 
>     arrange, count, desc, failwith, id, mutate, rename, summarise,
>     summarize
ans <- ddply(iris, .(Species), function(x) {h<-max(x$Sepal.Length);
p<-mean(x$Petal.Width);ans<-cbind(h,p)})
 names(ans)<-c("Species","Petal_meanWidth","Max_Sepal_length"); ans
>      Species Petal_meanWidth Max_Sepal_length
> 1     setosa             5.8            0.246
> 2 versicolor             7.0            1.326
> 3  virginica             7.9            2.026

Merge

Merge with common id. merge, join, join_all

> function (.data, ..., .add = FALSE, .drop = group_by_drop_default(.data))
group<-group_by(iris, Species) #se puede usar desc(Petal.Width) para el orden descendiente
summarise(group, Petal_meanWidth=mean(Petal.Width), Max_Sepal_length=max(Sepal.Length))
>   Petal_meanWidth Max_Sepal_length
> 1        1.199333              7.9

Plotting ang Graphs

Graphs

Intro

Important Save the file as svg. (scalable vector). So it wont lose resolution.

  • par()
  • plot() - hist()
  • boxplot()

Plots

In the base system, first a graph has to be initialized and then it has to be annotated.

  1. To initialize use a function like plot or hist
  2. To annotated par can be used to set or query graphical parameters. par() creates global parameters of the all graphics. Can be overridden. Anotation like adding lines, points, legends, etc.
  • pch the plotting symbol if its ’*" or ‘.’. 1 is circle
  • lty line type
  • lwd line width
  • col plotting color, number, string or hexcode
  • xlab and ylab labels
  • mar: A numerical vector of the form c(bottom, left, top, right) which gives the number of lines of margin to be specified on the four sides of the plot. The default is c(5, 4, 4, 2) + 0.1
  • bg background color
  • las orientation of the axis
  • mfrow ormcol to create a subplot: A vector of the form c(nr, nc). Subsequent figures will be drawn in an nr-by-nc array on the device by columns (mfcol), or rows (mfrow), respectively.
  • oma A vector of the form c(bottom, left, top, right) giving the size of the outer margins in lines of text. default par("oma") is 0 0 0 0
  • type =“n” makes the data invisible
Base Plotting Functions
  • lines: add lines to a plot, given a vector of x values and a corresponding vector of y values (or a 2-column matrix); this function just connects the dots

  • points: add points to a plot

  • text: add text labels to a plot using specified x, y coordinates

  • title: add annotations to x, y axis labels, title, subtitle, outer margin

  • mtext: add arbitrary text to the margins (inner or outer) of the plot

  • axis: adding axis ticks/labels

  • Can create scatterplots separated by category

with(data = iris, plot(Petal.Width, Petal.Length, col=Species))
title( main = " Petal Length vs Petal Width ")

with(data = airquality, plot(Wind, Ozone, main = " Ozone and Wind in NY", col="red")) #Title can be added to plot
with(subset(airquality, Month ==5), points(Wind, Ozone, col="blue")) # Added on top
legend("topright", pch=1, col=c("blue", "red"), legend=c("May", "Other Months"))

# Legend=topleft, toprigth, bottomleft, bottomrigth
Subplot
  • Can create subplots par() can be used to set or query graphical parameters
par(mfcol=c(1,2),mar=c(4,4,2,1))
hist(subset(iris, Species == "setosa")$Sepal.Length, breaks = 10, main="Setosa sepal length", xlab="Length", col="#8BC5CC")
hist(subset(iris, Species == "virginica")$Sepal.Length, breaks = 10, main="Virginica sepal length", xlab="Length", col="#8BCCA2")

par(mfrow = c(2, 3), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0))
with(airquality, 
{
 plot(Wind, Ozone, main = "Ozone and Wind")
 plot(Solar.R, Ozone, main = "Ozone and Solar Radiation", pch=2, type="p", col="blue", col.lab="red")
 plot(Temp, Ozone, main = "Ozone and Temperature", pch=5, type="l", col.main='#296A40')
 plot(Wind, Ozone, main = "Ozone and Wind", pch=10, col.axis='purple')
 plot(Solar.R, Ozone, main = "Ozone and Solar Radiation", pch=".", type="h", fg="green") 
 plot(Temp, Ozone, main = "Ozone and Temperature", pch="a", type="b")
 mtext("Ozone and Weather in New York City", outer = TRUE) #outer text
 })

Histogram

hist Computes a histogram of the given data values

> function (x, ...)
  • breaks=num sets the number of cells in the histogram
hist(iris$Sepal.Length, breaks = 10)

Can add rug() to see all the points.

hist(iris$Sepal.Length, breaks = 10)
rug(iris$Sepal.Length)

SmoothScatter() produces a smoothed color density representation of a scatterplot, obtained through a (2D) kernel density estimate. Like a 2D histogram.

x <- rnorm(10000)
y <- rnorm(10000)
par(mfrow=c(1,3))
plot(x,y, pch =19); smoothScatter(x,y); plot(x,y, col = rgb(0,0,0,0.05), pch =19)

Box plot

boxplot Produce box-and-whisker plot(s) of the given (grouped) values.

airquality <- transform( airquality, Month =factor(Month))
boxplot(Ozone ~ Month, airquality, xlab = "Month", ylab="Ozone ppb")

Lattice system

lattice Everything in one line code. How to interpret next code line: life.exp ~ Income is the same as y vs x, the | represents the condition. i.e I want to see Life.Exp vs Income per month. xyplot

library(lattice)
state<-data.frame(state.x77, region = state.region)
xyplot(Life.Exp ~ Income| region, data= state, layout=c(4,1))

airquality <- transform(airquality, Month = factor(Month))
xyplot(Ozone ~ Wind | Month, data = airquality, layout=c(5,1))

It’s worth to mention that without the transformation there would be no labels corresponding to the months

ggplot2

ggplot2 Mixes elements of Base and Lattice. Uses qplot().
Color in ggplot

str(qplot)
> function (x, y, ..., data, facets = NULL, margins = FALSE, geom = "auto", 
>     xlim = c(NA, NA), ylim = c(NA, NA), log = "", main = NULL, xlab = NULL, 
>     ylab = NULL, asp = NA, stat = NULL, position = NULL)
str(mpg)
> tibble [234 x 11] (S3: tbl_df/tbl/data.frame)
>  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
>  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
>  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
>  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
>  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
>  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
>  $ drv         : chr [1:234] "f" "f" "f" "f" ...
>  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
>  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
>  $ fl          : chr [1:234] "p" "p" "p" "p" ...
>  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...
p1 <- qplot(displ, hwy, data=mpg)
p2 <- qplot(displ, hwy, data=mpg, color=year, geom = c("point","smooth"))
grid.arrange(p1, p2, nrow = 1)
> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  • qplot using only one argument is an histogram
p1 <- qplot(hwy, data=mpg, fill=drv, main ="Separated by drv")
p2 <- qplot(hwy, data=mpg, fill=manufacturer, main ="Separated by manufacturer")
grid.arrange(p1, p2, nrow = 2)
> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • Facets
    The variable on the right of facets determines the columns of the panels. i.e each panel corresponds to a variable in drv.
p1 <- qplot(displ, hwy, data=mpg, facets = . ~ drv, main ="Without row definition") 
p2 <- qplot(hwy, data=mpg, facets = drv ~ ., main ="Without col definition", binwidth =2)
grid.arrange(p1, p2, nrow = 1)

  • Building in layers
air <- datasets::airquality
data <-na.omit(air);
data$Temp <- as.numeric(data$Temp); data$Day <- as.numeric(data$Day);
str(data); 
> 'data.frame': 111 obs. of  6 variables:
>  $ Ozone  : int  41 36 12 18 23 19 8 16 11 14 ...
>  $ Solar.R: int  190 118 149 313 299 99 19 256 290 274 ...
>  $ Wind   : num  7.4 8 12.6 11.5 8.6 13.8 20.1 9.7 9.2 10.9 ...
>  $ Temp   : num  67 72 74 62 65 59 61 69 66 68 ...
>  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
>  $ Day    : num  1 2 3 4 7 8 9 12 13 14 ...
>  - attr(*, "na.action")= 'omit' Named int [1:42] 5 6 10 11 25 26 27 32 33 34 ...
>   ..- attr(*, "names")= chr [1:42] "5" "6" "10" "11" ...
g <- ggplot(data, aes(Day, Temp)) #aes defines the mapping. i.e the x and y
g + geom_point(aes(color = Month)) + facet_grid( . ~ Month)+ scale_color_distiller(palette="Blues")

g + geom_point(aes(color = Month), size = 2) + scale_color_distiller(palette="Spectral")

ggplot(data, aes(Day, Temp, group=Month, color=Month)) + geom_line() +geom_point() + scale_color_distiller(palette="Oranges") + theme(plot.title = element_text(hjust = 0.5)) +labs( x="Days", y="Temperature", title = "Temp vs Days") 

NOTE To add limits do not add ylim instead coord_cartesian.

Color

To see all the colors that can be called by name use colors().

Using colorRamp() a palette can be created between the given colors.

ramp <- colorRamp(c("red", "blue")) #
ramp(1) #Blue
>      [,1] [,2] [,3]
> [1,]    0    0  255
ramp(0) #Red
>      [,1] [,2] [,3]
> [1,]  255    0    0
ramp(seq(0,1, len = 10)) 
>            [,1] [,2]      [,3]
>  [1,] 255.00000    0   0.00000
>  [2,] 226.66667    0  28.33333
>  [3,] 198.33333    0  56.66667
>  [4,] 170.00000    0  85.00000
>  [5,] 141.66667    0 113.33333
>  [6,] 113.33333    0 141.66667
>  [7,]  85.00000    0 170.00000
>  [8,]  56.66667    0 198.33333
>  [9,]  28.33333    0 226.66667
> [10,]   0.00000    0 255.00000

Using colorRampPalette() a hexadecimal system for colors can be created

pal <- colorRampPalette(c("red", "blue")) #
pal(1) #Blue
> [1] "#FF0000"
pal(10) #Red
>  [1] "#FF0000" "#E2001C" "#C60038" "#AA0055" "#8D0071" "#71008D" "#5500AA"
>  [8] "#3800C6" "#1C00E2" "#0000FF"
par(mfrow = c(2, 1))
hist(1:10, col=pal(10)); hist(1:10, col=pal(5))

Clustering

Intro

  • hclust()
  • kmeans()

Hierarchical clustering

hclust allows to create a hierarchical clustering. The idea is that the points with smallest distant among each other are merged together.

set.seed(1234)
x <- rnorm(12, mean = rep(1:3, each =4), sd=0.2)
y <- rnorm(12, mean = rep(c(1,2,1), each =4), sd=0.2)
plot(x,y, col ='blue', pch =19); text(x+0.05,y+0.05, labels = as.character(1:12))

dataFrame <- data.frame(x=x,y=y); dataFrame
>            x         y
> 1  0.7585869 0.8447492
> 2  1.0554858 1.0128918
> 3  1.2168882 1.1918988
> 4  0.5308605 0.9779429
> 5  2.0858249 1.8977981
> 6  2.1012112 1.8177609
> 7  1.8850520 1.8325657
> 8  1.8906736 2.4831670
> 9  2.8871096 1.0268176
> 10 2.8219924 0.9018628
> 11 2.9045615 0.9118904
> 12 2.8003227 1.0919179
distxy <-dist(dataFrame); distxy # Pairwise distance, default Euclidean distance
>             1          2          3          4          5          6          7
> 2  0.34120511                                                                  
> 3  0.57493739 0.24102750                                                       
> 4  0.26381786 0.52578819 0.71861759                                            
> 5  1.69424700 1.35818182 1.11952883 1.80666768                                 
> 6  1.65812902 1.31960442 1.08338841 1.78081321 0.08150268                      
> 7  1.49823399 1.16620981 0.92568723 1.60131659 0.21110433 0.21666557           
> 8  1.99149025 1.69093111 1.45648906 2.02849490 0.61704200 0.69791931 0.65062566
> 9  2.13629539 1.83167669 1.67835968 2.35675598 1.18349654 1.11500116 1.28582631
> 10 2.06419586 1.76999236 1.63109790 2.29239480 1.23847877 1.16550201 1.32063059
> 11 2.14702468 1.85183204 1.71074417 2.37461984 1.28153948 1.21077373 1.37369662
> 12 2.05664233 1.74662555 1.58658782 2.27232243 1.07700974 1.00777231 1.17740375
>             8          9         10         11
> 2                                             
> 3                                             
> 4                                             
> 5                                             
> 6                                             
> 7                                             
> 8                                             
> 9  1.76460709                                 
> 10 1.83517785 0.14090406                      
> 11 1.86999431 0.11624471 0.08317570           
> 12 1.66223814 0.10848966 0.19128645 0.20802789
hClustering <- hclust(distxy); plot(hClustering)

To create pretty dendograms:

myplclust(hClustering, lab = hClustering$order, lab.col = rep(1:3, each =4))

K-means

kmeans Perform k-means clustering on a data matrix. kmeans other source

kmeansobj <- kmeans(dataFrame, centers = 3, iter.max = 10)
kmeansobj$cluster #for each data, which cluster is in.
>  [1] 3 1 1 3 2 2 2 2 2 2 2 2
plot(x,y, col= kmeansobj$cluster, pch = 19); points(kmeansobj$centers, col=1:3, pch=3)

Statistics

Simulation

Intro

  • rnorm()
  • dnorm()
  • pnorm()
  • rnorm()
  • sample()

Distributions

There are different functions for probability distribution in R:

  • rnorm generates random variable Normal variables with a given mean and standard deviation
  • dnorm evaluate the normal probability density with a given mean or sd at a point
  • pnorm evaluate the cumulative distribution function for a norma distribution
  • rpois generate random poisson variates with a given rate

Hay diferentes functiones o letras asociadas: p: distribucion cumulativa, q: funcion cuantil, r: generador de numeros aleatorios, d: densidad. En orden: rnorm, dnorm

> function (n, mean = 0, sd = 1)
> function (x, mean = 0, sd = 1, log = FALSE)
> function (q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
> function (n, lambda)

Para reajustar o reestablecer el generador de numeros se utiliza set.seed(1).

Random Sampling

The function sample draws ramdomly from a specified set of sclara objects allowing to sample from arbitry distributions.

str(sample) # con repeat se puede obtener numeros repetidos
> function (x, size, replace = FALSE, prob = NULL)
set.seed(1)
sample(1:10,4)
> [1] 9 4 7 1

Funciones creadas

Text-Telegram

  • Returns a text with START at the begining and STOP at the end
telegram <- function(...){paste("START", ..., "STOP")}

Regression Models

Intro

  • Functions
    • lm()
    • melt()
    • lm()
    • lm()

Least Squares

In the 1880’s, Francis Galton was developing ways to quantify the heritability of traits. As part of this work, he collected data on the heights of adult children and their parents. We use melt to reshape the table

library(reshape); 
data(galton); 
str(galton)
> 'data.frame': 928 obs. of  2 variables:
>  $ child : num  61.7 61.7 61.7 61.7 61.7 62.2 62.2 62.2 62.2 62.2 ...
>  $ parent: num  70.5 68.5 65.5 64.5 64 67.5 67.5 67.5 66.5 66.5 ...
head(galton)
>   child parent
> 1  61.7   70.5
> 2  61.7   68.5
> 3  61.7   65.5
> 4  61.7   64.5
> 5  61.7   64.0
> 6  62.2   67.5
long <- melt(galton)
> Using  as id variables
head(long)
>   variable value
> 1    child  61.7
> 2    child  61.7
> 3    child  61.7
> 4    child  61.7
> 5    child  61.7
> 6    child  62.2
g <- ggplot(long, aes(x = value, fill = variable)) 
g <- g + geom_histogram(colour = "black", binwidth=1) 
g <- g + facet_grid(. ~ variable) + labs("Histogram of childs and parents heigths")
g

The command for simple linear regression is lm(response variable~explanatory variable). lm() is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance. Its the same calculating the betas or using lm().

When fitting a linear model y ~ x - 1 specifies a line through the origin. This is useful in some cases

y <- galton$child
x <- galton$parent
beta1 <- cor(y, x) *  sd(y) / sd(x)
beta0 <- mean(y) - beta1 * mean(x)
rbind(c(beta0, beta1), coef(lm(y ~ x))) #Same using both procedures
>      (Intercept)         x
> [1,]    23.94153 0.6462906
> [2,]    23.94153 0.6462906

Si se normalizan los datos, entonces, la pendiente es la correlacion. La correlacion de x y y es la misma que de los valores normalizados y de la ecuacion de la recta que pasa por los ultimos. Pero el coeficiente no es el mismo que cuando pasa por datos no normalizados.

yn <- (y - mean(y))/sd(y)
xn <- (x - mean(x))/sd(x)
c(cor(y, x), cor(yn, xn), coef(lm(yn ~ xn))[2])
>                            xn 
> 0.4587624 0.4587624 0.4587624

Add regression line to plot

Exercise

Consider the data set given below. Give the value of mu that minimazes the least squares equation

x <- c(0.18, -1.54, 0.42, 0.95)
w <- c(2, 1, 3, 1)

mean(w*x)/mean(w)
> [1] 0.1471429

Consider the data set. Fit the regression through the origin and get the slope treating y as the outcome and x as the regressor. (Hint, do not center the data since we want regression through the origin, not through the means of the data.)

x <- c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42)
y <- c(1.39, 0.72, 1.55, 0.48, 1.19, -1.59, 1.23, -0.65, 1.49, 0.05)

lm(formula = y ~ x - 1)
> 
> Call:
> lm(formula = y ~ x - 1)
> 
> Coefficients:
>      x  
> 0.8263
summary(lm(y~x-1))
> 
> Call:
> lm(formula = y ~ x - 1)
> 
> Residuals:
>     Min      1Q  Median      3Q     Max 
> -2.0692 -0.2536  0.5303  0.8592  1.1286 
> 
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> x   0.8263     0.5817   1.421    0.189
> 
> Residual standard error: 1.094 on 9 degrees of freedom
> Multiple R-squared:  0.1831,  Adjusted R-squared:  0.09238 
> F-statistic: 2.018 on 1 and 9 DF,  p-value: 0.1892

Do data(mtcars) from the datasets package and fit the regression model with mpg as the outcome and weight as the predictor. Give the slope coefficient.

data(mtcars)
head(mtcars)
>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
x <- mtcars$wt
y <- mtcars$mpg

lm(y ~ x)
> 
> Call:
> lm(formula = y ~ x)
> 
> Coefficients:
> (Intercept)            x  
>      37.285       -5.344
summary(lm(mpg ~ wt, mtcars)) #anothes form to write this
> 
> Call:
> lm(formula = mpg ~ wt, data = mtcars)
> 
> Residuals:
>     Min      1Q  Median      3Q     Max 
> -4.5432 -2.3647 -0.1252  1.4096  6.8727 
> 
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|)    
> (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
> wt           -5.3445     0.5591  -9.559 1.29e-10 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Residual standard error: 3.046 on 30 degrees of freedom
> Multiple R-squared:  0.7528,  Adjusted R-squared:  0.7446 
> F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

What is the value of the first measurement if x were normalized (to have mean 0 and variance 1)?

x <- c(8.58, 10.46, 9.01, 9.64, 8.86)
xn <- (x-mean(x))/sd(x)
xn
> [1] -0.9718658  1.5310215 -0.3993969  0.4393366 -0.5990954

What value minimizes the sum of the squared distances between these points and itself? For least squares the mean minimizes.

x <- c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42)
mean(x)
> [1] 0.573
x <- c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42)
y <- c(1.39, 0.72, 1.55, 0.48, 1.19, -1.59, 1.23, -0.65, 1.49, 0.05)

lm(y~x)
> 
> Call:
> lm(formula = y ~ x)
> 
> Coefficients:
> (Intercept)            x  
>       1.567       -1.713