Data Science - Coursera

Notas para la futura Alejandra. Recuerda SIEMPRE estos pasos: Guardar->Commit->Push (asi se hayan desaparecido los archivos) y ya. Puedes usar esta pagina donde estan las referencias de R Markdown. Se puede usar Esta pagina para cosas utiles.

Markdown Sheets
Markdown Info
R Cheat Sheets

Para borrar la consola cntrl+L y para borrar todas las variables rm(list=ls()). Para tener informacion de una funcion se pone “?funcion”

There are different types of classes of objetcs. R has five basic or atomic classes of objects:

character
numeric
integer (To specify the use L something like 5L)
complex
Logical

R objects can have attributes which are like metadata. Attributes can be accessed using attributes()

names, dimnames
dimensions
class
length
other

Para hacer Debugging:

traceback: where does the error occured. Devuelve donde ocurrio el error hay que usarla inmediatamente cuando sale un error.
debug y browser: cuando se llega a esta linea se empieza a correr linea por linea. Permite evaluar uno a uno. Para salir del debug se pone Q o c
trace
recover: permite ver el ambiente (environment en que estaba la variable)

Generalities

Vectores, Matrices, listas, factores, dataframe…

Intro

Functions
- c()
- vector()
- names()
- class()
- dimnames()
- list()
- factors()
- levels()
- is.na()/ is.nan()

Vectors

Vectors can have only one class of objects, when different objects are mixed in a vector, coercion occurs so that every element in the vector is of the same class. To create a vector:

Con la funcion c()

x<- c(0.5, 0.6) #Como si c significara concatenar
print(x)

> [1] 0.5 0.6

Tambien se le pueden poner nombres a las columnas

names(x)<-c("Food","Bar")
x

> Food  Bar 
>  0.5  0.6

Pueden haber distintos objetos (numericos, logicos, enteros, etc)

Con la funcion vector()

x<-vector("numeric", length=10)
x

>  [1] 0 0 0 0 0 0 0 0 0 0

Also, different modes can be used such as “logical”, “integer”, “numeric” (synonym “double”), “complex”, “character” and “raw”.
*** Para saber la clase entonces se usa class()

class(x)

> [1] "numeric"

Se puede forzar la conversion entre clases de numeros. Por ejemplo:

x<-0:6
class(x)

> [1] "integer"

as.numeric(x)

> [1] 0 1 2 3 4 5 6

as.character(x)

> [1] "0" "1" "2" "3" "4" "5" "6"

as.logical(x)

> [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Vectors dont have dimentions. However, can be assigned:

my_vector <-1:20  
dim(my_vector)

> NULL

dim(my_vector) <- c(4,5) #rows and cols
my_vector # now is a matrix

>      [,1] [,2] [,3] [,4] [,5]
> [1,]    1    5    9   13   17
> [2,]    2    6   10   14   18
> [3,]    3    7   11   15   19
> [4,]    4    8   12   16   20

Lists

Are a special type of vectors that contain elements of different classes. To create one:

Using list()

x<-list(1, "a", TRUE, 1+4i) #Recuerda que los numeros que cambian son los indices de los vectores, tienen dobles corchetes cuadrados
print(x)

> [[1]]
> [1] 1
> 
> [[2]]
> [1] "a"
> 
> [[3]]
> [1] TRUE
> 
> [[4]]
> [1] 1+4i

Create an empty list with vector()

x <- vector("list", length = 5)

Tambien pueden tener nombres

x<-list(a=1,b= "a",c= TRUE,d= 1+4i)
x

> $a
> [1] 1
> 
> $b
> [1] "a"
> 
> $c
> [1] TRUE
> 
> $d
> [1] 1+4i

Para eliminar una lista se puede usar unlist()

To access a list $ is used

x$b

> [1] "a"

Matrices

Matrices are vectores with a dimmension attribute. Se construyen column wise es decir, se llenan las filas de arriba para abajo una columna a la vez.

m<- matrix(1:6, nrow=2, ncol=3) # Se construye column wise
m

>      [,1] [,2] [,3]
> [1,]    1    3    5
> [2,]    2    4    6

dim(m)

> [1] 2 3

attributes(m) #Todo lo que se le pueda sacar

> $dim
> [1] 2 3

m[1,] #Para extraer solo la primera fila

> [1] 1 3 5

Pueden tener nombres. Also rownames() or colnames() can be used

dimnames(m) <- list(c("fila1","fila2"),c("col1","col2","col3")) #Se usa una listaaa
m

>       col1 col2 col3
> fila1    1    3    5
> fila2    2    4    6

Otra forma es redimensionar un vector

m<-1:10
dim(m) <- c(2,5) #dos filas, dos columnas
m

>      [,1] [,2] [,3] [,4] [,5]
> [1,]    1    3    5    7    9
> [2,]    2    4    6    8   10

Cbind-ing and rbind-ing

x<-1:3
y<-10:12
cbind(x,y) # Los une por columnas

>      x  y
> [1,] 1 10
> [2,] 2 11
> [3,] 3 12

print("Tambien se puede unir por filas")

> [1] "Tambien se puede unir por filas"

rbind(x,y)

>   [,1] [,2] [,3]
> x    1    2    3
> y   10   11   12

Factors

One can think of a factor as an integer vector where each integer has a label. Factors are automatically created when read.table() is used. Para tener datos categoricos:

x<-factor(c("yes","no","yes", "no"), levels = c("yes","no")) #Para determinar el orden de los niveles.
x

> [1] yes no  yes no 
> Levels: yes no

table(x) #Conteo

> x
> yes  no 
>   2   2

class(x)

> [1] "factor"

Data Frame

Are used to stored tabular data in R. dplyr is a package optimized for data frames. Are represented as a special type of list where every element of the list has to have the same length. Can store different classes. Generally are read using functions such as read.table() or read.csv()

x<-data.frame(food=1:4, bar=c(T,T,F,F))
x

>   food   bar
> 1    1  TRUE
> 2    2  TRUE
> 3    3 FALSE
> 4    4 FALSE

nrow(x)

> [1] 4

ncol(x)

> [1] 2

Para cambiar el nombre de las columnas. names() can be used tu columns

colnames(x)[c(1,2)]<-c("Name","Mortality"); x

>   Name Mortality
> 1    1      TRUE
> 2    2      TRUE
> 3    3     FALSE
> 4    4     FALSE

x<-data.frame(food=1:4, bar=c(T,T,F,F))
names(x) <- c("Name","Mortality")

To set the names in the rows use row.names()

Missing values

Missing values are denoted by NA or NaN for undefined mathematical operations.

is.na() is used to test objects if they are NA
is.nan() is used to test for NaN
NA values have a class also, so there are integer NA, character NA, etc.
NaN value is also NA but the converse is not true

Subsetting

Intro

The [ operator always returns an object of the same class as the original. It can be used to select multiple elements of an object
The [[ operator is used to extract elements of a list or a data frame. It can only be used to extract a single element and the class of the returned object will not necessarily be a list or data frame.
The $ operator is used to extract elements of a list or data frame by literal name. Its semantics are similar to that of [[.
Functions
- c()
- vector()
- names()
- class()
- dimnames()
- list()
- factors()
- levels()
- is.na()/ is.nan()
- identical()

`%in%`

v1 <- 3
v2 <- 101
t <- c(1,2,3,4,5,6,7,8)
v1 %in% t

> [1] TRUE

t %in% v1 #Its better big first.

> [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

Vector

x <- c("a", "b", "c", "c", "d", "a")
x[1] ## Extract the first element

> [1] "a"

x[c(1, 3, 4)] # The sequences doesnt has to be in order

> [1] "a" "c" "c"

x[-c(1, 3, 4)] # All elements but those

> [1] "b" "d" "a"

We can also pass a logical sequence to the [ operator to extract elements of a vector that satisfy a given condition. For example, here we want the elements of x that come lexicographically after the letter “a”.

x[x > "a"]

> [1] "b" "c" "c" "d"

Compare if two vectors are the same

identical(x,c(1,2))

> [1] FALSE

Matrix

x <- matrix(1:6, 2, 3)
x[1, 2]

> [1] 3

x[1, , drop = FALSE] #if the answer has to be a matrix

>      [,1] [,2] [,3]
> [1,]    1    3    5

Lists

Normal lists

x <- list(foo=1:4, bar=0.6) #lista
x[1] #Lista

> $foo
> [1] 1 2 3 4

x[[1]]

> [1] 1 2 3 4

class(x[2])

> [1] "list"

class(x[[2]])

> [1] "numeric"

x$foo

> [1] 1 2 3 4

x[[1]][[2]] #extraer un elemento de la lista

> [1] 2

Nested Lists

x <- list(a = list(10, 12, 14), b = c(3.14, 2.81)); x

> $a
> $a[[1]]
> [1] 10
> 
> $a[[2]]
> [1] 12
> 
> $a[[3]]
> [1] 14
> 
> 
> $b
> [1] 3.14 2.81

## Get the 3rd element of the 1st element
x[[c(1, 3)]] #or x[[1]][[3]]

> [1] 14

Estructuras de control

Intro

if and else: testing a condition and acting on it
for: execute a loop a fixed number of times
while: execute a loop while a condition is true
repeat: execute an infinite loop (must break out of it to stop)
break: break the execution of a loop
next: skip an interation of a loop

If Else

condition<-TRUE
if(condition==TRUE) {
  ## do something
  print(TRUE)
} else #else if 
  { print(FALSE)
  ## do something else
}

> [1] TRUE

for

Se puede usar next para cambiar de iteracion o para salir.

for(i in 1:10){
  print(i)
}

> [1] 1
> [1] 2
> [1] 3
> [1] 4
> [1] 5
> [1] 6
> [1] 7
> [1] 8
> [1] 9
> [1] 10

## otra forma de escribirlo
for(i in 1:4) print(i)

> [1] 1
> [1] 2
> [1] 3
> [1] 4

## Para crear una sequencia
seq(5)

> [1] 1 2 3 4 5

While

Se puede reemplazar con un repeat pero hay que usar un break

count<-0
while(count<10) {
  print(count)
  count<-count+1
}

> [1] 0
> [1] 1
> [1] 2
> [1] 3
> [1] 4
> [1] 5
> [1] 6
> [1] 7
> [1] 8
> [1] 9

Funciones

Intro

To see the source code , just type the function without parenthesis. To see the arguments type args(function)

seq()
seq_along()
Funciones logicas
- isTRUE()
- xor()
- all()
- any()
which()

Forma general

above_10 <- function (x, n=10){
  use <- x>n
  x[use]
  # Se puede especificar default values.
}
above_10(c(1,4,6,8,99))

> [1] 99

Funciones Anonimas

evaluate <- function(func, dat){func(dat)}
evaluate(function(x){x[1]}, c(8, 4, 0)) #return the first element of the vector

> [1] 8

evaluate(function(x){x[length(x)]}, c(8, 4, 0))  #return the last element of the vector

> [1] 0

Sequences

How to create a sequence

seq(1,10, by=2)

> [1] 1 3 5 7 9

seq(1,10,length=3)

> [1]  1.0  5.5 10.0

seq(along=c(1,3,8,25,100)) #create a vector with the length of the vector

> [1] 1 2 3 4 5

seq_along(c(1,3,8,25,100)) #Same

> [1] 1 2 3 4 5

Funciones logicas

The & is first evaluated than |. && or || evaluates only the first argument. * isTRUE()

isTRUE(3>2)

> [1] TRUE

The xor() function stands for exclusive OR. If one argument evaluates to TRUE and one argument evaluates to FALSE, then this function will return TRUE, otherwise it will return FALSE.

xor(5==6, !FALSE)

> [1] TRUE

xor(TRUE, TRUE)

> [1] FALSE

The any() function will return TRUE if one or more of the elements in the logical vector is TRUE. The all() function will return TRUE if every element in the logical vector is TRUE.

ints <- sample(10)
any(ints<0)

> [1] FALSE

all(ints>0)

> [1] TRUE

Index

Para encontrar los indices se usa which()

ints <-sample(10)
which(ints>5)

> [1] 1 4 5 6 9

Looping functions

Intro

lapply()
sapply()
apply()
mapply() vectorize()
tapply()
split()

lapply

The lapply() function does the following simple series of operations:
1. it loops over a list, iterating over each element in that list
2. it applies a function to each element of the list (a function that you specify)
3. and returns a list (the l is for “list”).

x<-list(a=matrix(1:4, 2,2), b=matrix(1:6,3,2)); x

> $a
>      [,1] [,2]
> [1,]    1    3
> [2,]    2    4
> 
> $b
>      [,1] [,2]
> [1,]    1    4
> [2,]    2    5
> [3,]    3    6

lapply(x, mean)

> $a
> [1] 2.5
> 
> $b
> [1] 3.5

When you pass a function to lapply(), lapply() takes elements of the list and passes them as the first argument of the function you are applying.

Si se tiene una lista, se va a aplicar la funcion a toda la lista.

library(datasets)
data(iris)
iris[1:5,]

>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
> 1          5.1         3.5          1.4         0.2  setosa
> 2          4.9         3.0          1.4         0.2  setosa
> 3          4.7         3.2          1.3         0.2  setosa
> 4          4.6         3.1          1.5         0.2  setosa
> 5          5.0         3.6          1.4         0.2  setosa

lapply(iris,mean)

> Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
> returning NA

> $Sepal.Length
> [1] 5.843333
> 
> $Sepal.Width
> [1] 3.057333
> 
> $Petal.Length
> [1] 3.758
> 
> $Petal.Width
> [1] 1.199333
> 
> $Species
> [1] NA

lapply(x, function(elt) { elt[,1] }) #Elt is the argument, i.e the list

> $a
> [1] 1 2
> 
> $b
> [1] 1 2 3

You can put an arbitrarily complicated function definition inside lapply(), but if it’s going to be more complicated, it’s probably a better idea to define the function separately.

sapply

The sapply() function behaves similarly to lapply(); the only real difference is in the return value. sapply() will try to simplify the result of lapply() if possible. Essentially, sapply() calls lapply() on its input and then applies the following algorithm:

If the result is a list where every element is length 1, then a vector is returned
If the result is a list where every element is a vector of the same length (> 1), a matrix is returned.
If it can’t figure things out, a list is returned

x<-list(a=matrix(1:4, 2,2), b=matrix(1:6,3,2)); x

> $a
>      [,1] [,2]
> [1,]    1    3
> [2,]    2    4
> 
> $b
>      [,1] [,2]
> [1,]    1    4
> [2,]    2    5
> [3,]    3    6

sapply(x, mean)

>   a   b 
> 2.5 3.5

library(datasets)
s <- split(airquality, airquality$Month)
sapply(s, function(x) {colMeans(x[, c("Ozone", "Solar.R", "Wind")],na.rm = TRUE)})

>                 5         6          7          8         9
> Ozone    23.61538  29.44444  59.115385  59.961538  31.44828
> Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333
> Wind     11.62258  10.26667   8.941935   8.793548  10.18000

apply

The apply() function is used to a evaluate a function (often an anonymous one) over the margins of an array. It is most often used to apply a function to the rows or columns of a matrix (which is just a 2-dimensional array).

str(apply)

> function (X, MARGIN, FUN, ...)

Por ejemplo:

x<-matrix(rnorm(200),20,10)
apply(x,2,mean) #dimension 2 is cols. El promedio se hace vertical

>  [1]  0.17307975  0.07813620 -0.21273428  0.35048079  0.01849736  0.17894928
>  [7] -0.26329213  0.19919152 -0.16809101  0.03871802

apply(x,1,sum)  #Sum through the rows

>  [1] -4.36190992  5.88912273 -3.74986586 -0.46952941 -2.26029643 -1.72725736
>  [7]  4.79822344 -4.47910605 -0.06277095 -0.80749994  0.88526970  2.36069841
> [13] -2.47805374  2.36070808 -1.20829089  0.74987462  4.37266752  2.53142014
> [19]  4.36210080  1.15320514

Hay funciones como rowsum, o colsum que pueden reducir el codigo.

mapply

The mapply() function is a multivariate apply of sorts which applies a function in parallel over a set of arguments. Recall that lapply() and friends only iterate over a single R object.

str(mapply)

> function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)

FUN is a function to apply
… contains arguments to apply over
MoreArgs is a list of other arguments to FUN
SIMPLIFY indicates whether the result should be simplyfied

Por ejemplo, en vez de usar list(rep(1,4), rep(2,3), rep(3,2), rep(4,1)) se puede usar:

mapply(rep,1:4,4:1)

> [[1]]
> [1] 1 1 1 1
> 
> [[2]]
> [1] 2 2 2
> 
> [[3]]
> [1] 3 3
> 
> [[4]]
> [1] 4

mapply() can be used to vectorize a function:

sumsq <- function(mu, sigma, x) {sum(((x - mu) / sigma)^2)}
x <- rnorm(100) ## Generate some data
sumsq(1:10, 1:10, x) #not the result that we want

> [1] 104.2627

mapply(sumsq, 1:10, 1:10, MoreArgs = list(x = x)) #We can use also vectorize

>  [1] 218.2807 136.8669 119.6283 112.8651 109.4011 107.3393 105.9878 105.0405
>  [9] 104.3429 103.8096

tapply

tapply() is used to apply a function over subsets of a vector. It can be thought of as a combination of split() and sapply() for vectors only. I’ve been told that the “t” in tapply() refers to “table”, but that is unconfirmed.

str(tapply)

> function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)

X is a vector
INDEX is a factor or a list of factors
FUN is a function to apply
… contains arguments to apply over
SIMPLIFY indicates whether the result should be simplyfied

x<-c(rnorm(10), runif(10),rnorm(10))
f<-gl(3,10) #Genera los niveles de los factores gl(n,k) n es el numero de niveles y k el numero de replicas
tapply(x,f,mean)

>          1          2          3 
>  0.2826848  0.3490682 -0.1937033

data("mtcars")
mtcars[1:5,]

>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

tapply(mtcars$mpg,mtcars$cyl,mean)

>        4        6        8 
> 26.66364 19.74286 15.10000

split

split takes a vector or other objetc and splits it into groups determined by a factor of list of factors

str(split)
function(x, f, drop=FALSE,... )

x is a vector or data frame
f is a factores or a list of factors
drop indicates whether empty factors levels should be dropped

x<-c(rnorm(10), runif(10),rnorm(10))
f<-gl(3,10) #Genera los niveles de los factores gl(n,k) n es el numero de niveles y k el numero de replicas
split(x,f) #siempre devuelve una lista. Se puede usar con otra function

> $`1`
>  [1]  0.008569759  0.477261567  1.931180644 -0.936025291 -2.179742864
>  [6]  0.114505524 -1.050862210 -0.096015922 -1.214242667 -0.199772953
> 
> $`2`
>  [1] 0.9795859 0.4812531 0.3831788 0.6900134 0.7154081 0.9284080 0.1054450
>  [8] 0.9040322 0.3522041 0.9960455
> 
> $`3`
>  [1] -1.19798852 -0.86591087 -2.21173945  1.15170942  1.27622749  0.32057739
>  [7]  0.97449109  0.49393589  0.06746323  1.23617451

lapply(split(x,f),mean)

> $`1`
> [1] -0.3145144
> 
> $`2`
> [1] 0.6535574
> 
> $`3`
> [1] 0.124494

Usefull example

library(datasets)
s <- split(airquality, airquality$Month)
sapply(s, function(x) {colMeans(x[, c("Ozone", "Solar.R", "Wind")],na.rm = TRUE)})

>                 5         6          7          8         9
> Ozone    23.61538  29.44444  59.115385  59.961538  31.44828
> Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333
> Wind     11.62258  10.26667   8.941935   8.793548  10.18000

Regular Expressions - Texto

Intro

Texto
- paste()
- strcmp()
- tolower()-toUpper()
- strsplit()
grep() grepl()
sapply()
apply()
mapply() vectorize()
tapply()
split()

Texto - General

To read text files use readLines(...txt) * Para convertir vectores a caracteres unidos

paste(as.character(c(rep(0,times=3-length(1)),1)), collapse ="")

> [1] "001"

paste0(as.character(c(rep(0,times=3-length(1)),1)), collapse = "")

> [1] "001"

Para comparar dos strings

library(pracma)
strcmpi("Hola","hola") #case insensitive

> [1] TRUE

strcmp("Hola","hola")  #case sensitive

> [1] FALSE

Para poner todas las letras en minuscula

tolower("HOLA")

> [1] "hola"

Para separar de acuerdo a un indicador

testName <- "HOLA.Como.Estas"
strsplit(testName,"\\.")

> [[1]]
> [1] "HOLA"  "Como"  "Estas"

Para sustituir un caracter

gsub("[[:punct:]]"," ",testName) #removes all puntuaction

> [1] "HOLA Como Estas"

gsub("[.]"," ",testName) #tambien se puede [-]

> [1] "HOLA Como Estas"

Metacharacters:

Metacharacters * ^ para buscar al inicio * $ final * [Bb] permitir las mayusculas y minusculas. * ? Para hasta el primer match. Ver regexpr()

library(datasets)
grep("^New", state.name) #Index

> [1] 29 30 31 32

grep("^New", state.name, value = TRUE) #Values

> [1] "New Hampshire" "New Jersey"    "New Mexico"    "New York"

grep() - grepl()

grep() is used to match a label into a character vector. Returns a vector of the indices of the elements of x that yielded a match. By setting value = TRUE the output is the real value.

> function (pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, 
>     fixed = FALSE, useBytes = FALSE, invert = FALSE)

g<- grep("Cause: shooting", homicides);length(g)

> [1] 228

g<- grep("Cause: [Ss]hooting", homicides);length(g) #capital insensitive

> [1] 1003

The function grepl() works much like grep() except that it differs in its return value. grepl() returns a logical vector indicating which element of a character vector contains the match.

> function (pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, 
>     useBytes = FALSE)

g<-grepl("^New", state.name);g

>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> [25] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
> [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> [49] FALSE FALSE

state.name[g]

> [1] "New Hampshire" "New Jersey"    "New Mexico"    "New York"

regexpr() - regexec()

> function (pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, 
>     useBytes = FALSE)

The regexpr() function gives you the (a) index into each string where the match begins and the (b) length of the match for that string. regexpr() only gives you the first match of the string (reading left to right). gregexpr() will give you all of the matches in a given string if there are is more than one match.

regexpr("<dd>[F|f]ound(.*)</dd>", homicides[1]); r<-regexpr("<dd>[F|f]ound(.*)</dd>", homicides[1])

> [1] 177
> attr(,"match.length")
> [1] 93
> attr(,"index.type")
> [1] "chars"
> attr(,"useBytes")
> [1] TRUE

regmatches(homicides[1],r)

> [1] "<dd>Found on January 1, 2007</dd><dd>Victim died at Shock Trauma</dd><dd>Cause: shooting</dd>"

regexpr("<dd>[F|f]ound(.*?)</dd>", homicides[1]); r<-regexpr("<dd>[F|f]ound(.*?)</dd>", homicides[1])

> [1] 177
> attr(,"match.length")
> [1] 33
> attr(,"index.type")
> [1] "chars"
> attr(,"useBytes")
> [1] TRUE

regmatches(homicides[1],r)

> [1] "<dd>Found on January 1, 2007</dd>"

The regexec() function works like regexpr() except it gives you the indices for parenthesized subexpressions.

r<-regexec("<dd>[F|f]ound on (.*?)</dd>", homicides[1])
regmatches(homicides[1], r)

> [[1]]
> [1] "<dd>Found on January 1, 2007</dd>" "January 1, 2007"

gsub() - sub()

sub and gsub return a character vector of the same length and with the same attributes as x (after possible coercion to character).

x <- substr(homicides[1], 177, 177 + 33 - 1); x

> [1] "<dd>Found on January 1, 2007</dd>"

sub("<dd>[F|f]ound on |</dd>", "", x)

> [1] "January 1, 2007</dd>"

gsub("<dd>[F|f]ound on |</dd>", "", x)

> [1] "January 1, 2007"

Useful functions

nchar()

nchar Takes a character vector as an argument and returns a vector whose elements contain the sizes of the corresponding elements of x. Its really fast. Returns a data.frame in a vector is introduced.

x <- c("asfef", "qwerty", "yuiop[", "b", "stuff.blah.yech")
nchar(x)

> [1]  5  6  6  1 15

Data Tables - Data Frames

Intro

subset()

subsetting

Creates a subset of a data table

library(datasets)
head(subset(iris, Species=="setosa"))

>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
> 1          5.1         3.5          1.4         0.2  setosa
> 2          4.9         3.0          1.4         0.2  setosa
> 3          4.7         3.2          1.3         0.2  setosa
> 4          4.6         3.1          1.5         0.2  setosa
> 5          5.0         3.6          1.4         0.2  setosa
> 6          5.4         3.9          1.7         0.4  setosa

class(subset(iris, Species=="setosa"))

> [1] "data.frame"

Folders and Files

Intro

download.file()
read.table()

Files

Connection with outside world

file, opens a connection to a file
url, opens a connection to url
gzfile and bzfile opens a connection with compresed files
file.exist Check if files exist. Returns TRUE or FALSE
dir.create Create directory if it doesnt exist.
download.file

> function (url, destfile, method, quiet = FALSE, mode = "w", cacheOK = TRUE, 
>     extra = getOption("download.file.extra"), headers = NULL, ...)

Ejemplo

url <-"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
destfile <-"Ejercicio_Semana3/US_communities.csv"
download.file(url, destfile)

Reading files

Hay diferentes funciones para leer los datos:

read.table, read.csv Son iguales, la unica difrencia es que read.table el separador por defecto es un espacio y en read.csv es una coma
readLines, para leer lineas de texto
source, Para leer R code files
dget, para leer R code files
load, para cargar los datos
unserialize, para leer en forma binaria
read.table() not so useful for big data.
- If you are trying to read a .csv file you should use sep =",", header=TRUE.
- na.strings set the character that represents a missing value
- skip number of lines to skip before starting to read
- Cuando se van a leer bases de datos muy grandes, se pueden hacer unas optimizaciones:
- Poner comment.char="" si no hay comentarios en el archivo
- Especificar colClasses. Una forma de encontrar las clases de cada columna es:

 initial<- read.table("Nombredelarchivo.txt", nrows=100)  
 classes <- sapply(initial,class)  
 tabAll< read.table("Nombredelarchivo.txt", colClasses=classes)

+ Especificar nrows, se puede sobre estimar  
+ Puede hacer el calculo de la memoria que va a gastar.

reading excel files
- library(xlsx) es un paquete para leer excel. Tambien XLConnect tiene mas opciones
- Se puede usar la funcion read.xlsx. Esta funcion permite leer columnas y filas especificas.
- para crear un archivo write.xlsx
data.table es un paquete
- library(data.table)
- tables() todas las tablas del documento
- Importante Para crear una copia de una tabla se debe hacer uso de la funcion copy

dplyr Package

Intro

Its useful for working with data frames.

select()
filter()
arrange()
rename()
mutate()
group_by()
merge()

Select

> function (.data, ...)

Para ver los datos de columnas

head(select(iris, Sepal.Length:Petal.Length))

>   Sepal.Length Sepal.Width Petal.Length
> 1          5.1         3.5          1.4
> 2          4.9         3.0          1.4
> 3          4.7         3.2          1.3
> 4          4.6         3.1          1.5
> 5          5.0         3.6          1.4
> 6          5.4         3.9          1.7

Para omitir los datos de una columna

head(select(iris, -Sepal.Length))

>   Sepal.Width Petal.Length Petal.Width Species
> 1         3.5          1.4         0.2  setosa
> 2         3.0          1.4         0.2  setosa
> 3         3.2          1.3         0.2  setosa
> 4         3.1          1.5         0.2  setosa
> 5         3.6          1.4         0.2  setosa
> 6         3.9          1.7         0.4  setosa

Filter

Subset rows given a condition.

> function (.data, ..., .preserve = FALSE)

head(filter(iris, Petal.Width>0.2))

>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
> 1          5.4         3.9          1.7         0.4  setosa
> 2          4.6         3.4          1.4         0.3  setosa
> 3          5.7         4.4          1.5         0.4  setosa
> 4          5.4         3.9          1.3         0.4  setosa
> 5          5.1         3.5          1.4         0.3  setosa
> 6          5.7         3.8          1.7         0.3  setosa

Arrange

Order the rows

> function (.data, ..., .by_group = FALSE)

head(arrange(iris, Petal.Width)) #se puede usar desc(Petal.Width) para el orden descendiente

>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
> 1          4.9         3.1          1.5         0.1  setosa
> 2          4.8         3.0          1.4         0.1  setosa
> 3          4.3         3.0          1.1         0.1  setosa
> 4          5.2         4.1          1.5         0.1  setosa
> 5          4.9         3.6          1.4         0.1  setosa
> 6          5.1         3.5          1.4         0.2  setosa

tail(arrange(iris, Petal.Width))

>     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
> 145          5.8         2.8          5.1         2.4 virginica
> 146          6.3         3.4          5.6         2.4 virginica
> 147          6.7         3.1          5.6         2.4 virginica
> 148          6.3         3.3          6.0         2.5 virginica
> 149          7.2         3.6          6.1         2.5 virginica
> 150          6.7         3.3          5.7         2.5 virginica

Rename

Rename cols

> function (.data, ...)

head(rename(iris, Petal_Width=Petal.Width)) #se puede usar desc(Petal.Width) para el orden descendiente

>   Sepal.Length Sepal.Width Petal.Length Petal_Width Species
> 1          5.1         3.5          1.4         0.2  setosa
> 2          4.9         3.0          1.4         0.2  setosa
> 3          4.7         3.2          1.3         0.2  setosa
> 4          4.6         3.1          1.5         0.2  setosa
> 5          5.0         3.6          1.4         0.2  setosa
> 6          5.4         3.9          1.7         0.4  setosa

Mutate

Create or adds new variables

> function (.data, ...)

head(mutate(iris, Petal_std=Petal.Width-mean(Petal.Width))) #se puede usar desc(Petal.Width) para el orden descendiente

>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  Petal_std
> 1          5.1         3.5          1.4         0.2  setosa -0.9993333
> 2          4.9         3.0          1.4         0.2  setosa -0.9993333
> 3          4.7         3.2          1.3         0.2  setosa -0.9993333
> 4          4.6         3.1          1.5         0.2  setosa -0.9993333
> 5          5.0         3.6          1.4         0.2  setosa -0.9993333
> 6          5.4         3.9          1.7         0.4  setosa -0.7993333

Group_by - ddply

Group by categories. Es una especie de tabla dinamica.

> function (.data, ..., .add = FALSE, .drop = group_by_drop_default(.data))

group<-group_by(iris, Species) #se puede usar desc(Petal.Width) para el orden descendiente
summarise(group, Petal_meanWidth = mean(Petal.Width), Max_Sepal_length = max(Sepal.Length))

> `summarise()` ungrouping output (override with `.groups` argument)

> # A tibble: 3 x 3
>   Species    Petal_meanWidth Max_Sepal_length
>   <fct>                <dbl>            <dbl>
> 1 setosa               0.246              5.8
> 2 versicolor           1.33               7  
> 3 virginica            2.03               7.9

ddply() Split Data Frame, Apply Function, And Return Results In A Data Frame.

library(plyr)

> Warning: package 'plyr' was built under R version 4.0.2

> ------------------------------------------------------------------------------

> You have loaded plyr after dplyr - this is likely to cause problems.
> If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
> library(plyr); library(dplyr)

> ------------------------------------------------------------------------------

> 
> Attaching package: 'plyr'

> The following objects are masked from 'package:dplyr':
> 
>     arrange, count, desc, failwith, id, mutate, rename, summarise,
>     summarize

ans <- ddply(iris, .(Species), function(x) {h<-max(x$Sepal.Length);
p<-mean(x$Petal.Width);ans<-cbind(h,p)})
 names(ans)<-c("Species","Petal_meanWidth","Max_Sepal_length"); ans

>      Species Petal_meanWidth Max_Sepal_length
> 1     setosa             5.8            0.246
> 2 versicolor             7.0            1.326
> 3  virginica             7.9            2.026

Merge

Merge with common id. merge, join, join_all

> function (.data, ..., .add = FALSE, .drop = group_by_drop_default(.data))

group<-group_by(iris, Species) #se puede usar desc(Petal.Width) para el orden descendiente
summarise(group, Petal_meanWidth=mean(Petal.Width), Max_Sepal_length=max(Sepal.Length))

>   Petal_meanWidth Max_Sepal_length
> 1        1.199333              7.9

Plotting ang Graphs

Graphs

Intro

Important Save the file as svg. (scalable vector). So it wont lose resolution.

par()
plot() - hist()
boxplot()

Plots

In the base system, first a graph has to be initialized and then it has to be annotated.

To initialize use a function like plot or hist
To annotated par can be used to set or query graphical parameters. par() creates global parameters of the all graphics. Can be overridden. Anotation like adding lines, points, legends, etc.

pch the plotting symbol if its ’*" or ‘.’. 1 is circle
lty line type
lwd line width
col plotting color, number, string or hexcode
xlab and ylab labels
mar: A numerical vector of the form c(bottom, left, top, right) which gives the number of lines of margin to be specified on the four sides of the plot. The default is c(5, 4, 4, 2) + 0.1
bg background color
las orientation of the axis
mfrow ormcol to create a subplot: A vector of the form c(nr, nc). Subsequent figures will be drawn in an nr-by-nc array on the device by columns (mfcol), or rows (mfrow), respectively.
oma A vector of the form c(bottom, left, top, right) giving the size of the outer margins in lines of text. default par("oma") is 0 0 0 0
type =“n” makes the data invisible

Base Plotting Functions

lines: add lines to a plot, given a vector of x values and a corresponding vector of y values (or a 2-column matrix); this function just connects the dots
points: add points to a plot
text: add text labels to a plot using specified x, y coordinates
title: add annotations to x, y axis labels, title, subtitle, outer margin
mtext: add arbitrary text to the margins (inner or outer) of the plot
axis: adding axis ticks/labels
Can create scatterplots separated by category

with(data = iris, plot(Petal.Width, Petal.Length, col=Species))
title( main = " Petal Length vs Petal Width ")

with(data = airquality, plot(Wind, Ozone, main = " Ozone and Wind in NY", col="red")) #Title can be added to plot
with(subset(airquality, Month ==5), points(Wind, Ozone, col="blue")) # Added on top
legend("topright", pch=1, col=c("blue", "red"), legend=c("May", "Other Months"))

# Legend=topleft, toprigth, bottomleft, bottomrigth

Subplot

Can create subplots par() can be used to set or query graphical parameters

par(mfcol=c(1,2),mar=c(4,4,2,1))
hist(subset(iris, Species == "setosa")$Sepal.Length, breaks = 10, main="Setosa sepal length", xlab="Length", col="#8BC5CC")
hist(subset(iris, Species == "virginica")$Sepal.Length, breaks = 10, main="Virginica sepal length", xlab="Length", col="#8BCCA2")

par(mfrow = c(2, 3), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0))
with(airquality, 
{
 plot(Wind, Ozone, main = "Ozone and Wind")
 plot(Solar.R, Ozone, main = "Ozone and Solar Radiation", pch=2, type="p", col="blue", col.lab="red")
 plot(Temp, Ozone, main = "Ozone and Temperature", pch=5, type="l", col.main='#296A40')
 plot(Wind, Ozone, main = "Ozone and Wind", pch=10, col.axis='purple')
 plot(Solar.R, Ozone, main = "Ozone and Solar Radiation", pch=".", type="h", fg="green") 
 plot(Temp, Ozone, main = "Ozone and Temperature", pch="a", type="b")
 mtext("Ozone and Weather in New York City", outer = TRUE) #outer text
 })

Histogram

hist Computes a histogram of the given data values

> function (x, ...)

breaks=num sets the number of cells in the histogram

hist(iris$Sepal.Length, breaks = 10)

Can add rug() to see all the points.

hist(iris$Sepal.Length, breaks = 10)
rug(iris$Sepal.Length)

SmoothScatter() produces a smoothed color density representation of a scatterplot, obtained through a (2D) kernel density estimate. Like a 2D histogram.

x <- rnorm(10000)
y <- rnorm(10000)
par(mfrow=c(1,3))
plot(x,y, pch =19); smoothScatter(x,y); plot(x,y, col = rgb(0,0,0,0.05), pch =19)

Box plot

boxplot Produce box-and-whisker plot(s) of the given (grouped) values.

airquality <- transform( airquality, Month =factor(Month))
boxplot(Ozone ~ Month, airquality, xlab = "Month", ylab="Ozone ppb")

Lattice system

lattice Everything in one line code. How to interpret next code line: life.exp ~ Income is the same as y vs x, the | represents the condition. i.e I want to see Life.Exp vs Income per month. xyplot

library(lattice)
state<-data.frame(state.x77, region = state.region)
xyplot(Life.Exp ~ Income| region, data= state, layout=c(4,1))

airquality <- transform(airquality, Month = factor(Month))
xyplot(Ozone ~ Wind | Month, data = airquality, layout=c(5,1))

It’s worth to mention that without the transformation there would be no labels corresponding to the months

ggplot2

ggplot2 Mixes elements of Base and Lattice. Uses qplot().
Color in ggplot

str(qplot)

> function (x, y, ..., data, facets = NULL, margins = FALSE, geom = "auto", 
>     xlim = c(NA, NA), ylim = c(NA, NA), log = "", main = NULL, xlab = NULL, 
>     ylab = NULL, asp = NA, stat = NULL, position = NULL)

str(mpg)

> tibble [234 x 11] (S3: tbl_df/tbl/data.frame)
>  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
>  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
>  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
>  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
>  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
>  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
>  $ drv         : chr [1:234] "f" "f" "f" "f" ...
>  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
>  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
>  $ fl          : chr [1:234] "p" "p" "p" "p" ...
>  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

p1 <- qplot(displ, hwy, data=mpg)
p2 <- qplot(displ, hwy, data=mpg, color=year, geom = c("point","smooth"))
grid.arrange(p1, p2, nrow = 1)

> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

qplot using only one argument is an histogram

p1 <- qplot(hwy, data=mpg, fill=drv, main ="Separated by drv")
p2 <- qplot(hwy, data=mpg, fill=manufacturer, main ="Separated by manufacturer")
grid.arrange(p1, p2, nrow = 2)

> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Facets
The variable on the right of facets determines the columns of the panels. i.e each panel corresponds to a variable in drv.

p1 <- qplot(displ, hwy, data=mpg, facets = . ~ drv, main ="Without row definition") 
p2 <- qplot(hwy, data=mpg, facets = drv ~ ., main ="Without col definition", binwidth =2)
grid.arrange(p1, p2, nrow = 1)

Building in layers

air <- datasets::airquality
data <-na.omit(air);
data$Temp <- as.numeric(data$Temp); data$Day <- as.numeric(data$Day);
str(data);

> 'data.frame': 111 obs. of  6 variables:
>  $ Ozone  : int  41 36 12 18 23 19 8 16 11 14 ...
>  $ Solar.R: int  190 118 149 313 299 99 19 256 290 274 ...
>  $ Wind   : num  7.4 8 12.6 11.5 8.6 13.8 20.1 9.7 9.2 10.9 ...
>  $ Temp   : num  67 72 74 62 65 59 61 69 66 68 ...
>  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
>  $ Day    : num  1 2 3 4 7 8 9 12 13 14 ...
>  - attr(*, "na.action")= 'omit' Named int [1:42] 5 6 10 11 25 26 27 32 33 34 ...
>   ..- attr(*, "names")= chr [1:42] "5" "6" "10" "11" ...

g <- ggplot(data, aes(Day, Temp)) #aes defines the mapping. i.e the x and y
g + geom_point(aes(color = Month)) + facet_grid( . ~ Month)+ scale_color_distiller(palette="Blues")

g + geom_point(aes(color = Month), size = 2) + scale_color_distiller(palette="Spectral")

ggplot(data, aes(Day, Temp, group=Month, color=Month)) + geom_line() +geom_point() + scale_color_distiller(palette="Oranges") + theme(plot.title = element_text(hjust = 0.5)) +labs( x="Days", y="Temperature", title = "Temp vs Days")

NOTE To add limits do not add ylim instead coord_cartesian.

Color

To see all the colors that can be called by name use colors().

Using colorRamp() a palette can be created between the given colors.

ramp <- colorRamp(c("red", "blue")) #
ramp(1) #Blue

>      [,1] [,2] [,3]
> [1,]    0    0  255

ramp(0) #Red

>      [,1] [,2] [,3]
> [1,]  255    0    0

ramp(seq(0,1, len = 10))

>            [,1] [,2]      [,3]
>  [1,] 255.00000    0   0.00000
>  [2,] 226.66667    0  28.33333
>  [3,] 198.33333    0  56.66667
>  [4,] 170.00000    0  85.00000
>  [5,] 141.66667    0 113.33333
>  [6,] 113.33333    0 141.66667
>  [7,]  85.00000    0 170.00000
>  [8,]  56.66667    0 198.33333
>  [9,]  28.33333    0 226.66667
> [10,]   0.00000    0 255.00000

Using colorRampPalette() a hexadecimal system for colors can be created

pal <- colorRampPalette(c("red", "blue")) #
pal(1) #Blue

> [1] "#FF0000"

pal(10) #Red

>  [1] "#FF0000" "#E2001C" "#C60038" "#AA0055" "#8D0071" "#71008D" "#5500AA"
>  [8] "#3800C6" "#1C00E2" "#0000FF"

par(mfrow = c(2, 1))
hist(1:10, col=pal(10)); hist(1:10, col=pal(5))

Clustering

Intro

hclust()
kmeans()

Hierarchical clustering

hclust allows to create a hierarchical clustering. The idea is that the points with smallest distant among each other are merged together.

set.seed(1234)
x <- rnorm(12, mean = rep(1:3, each =4), sd=0.2)
y <- rnorm(12, mean = rep(c(1,2,1), each =4), sd=0.2)
plot(x,y, col ='blue', pch =19); text(x+0.05,y+0.05, labels = as.character(1:12))

dataFrame <- data.frame(x=x,y=y); dataFrame

>            x         y
> 1  0.7585869 0.8447492
> 2  1.0554858 1.0128918
> 3  1.2168882 1.1918988
> 4  0.5308605 0.9779429
> 5  2.0858249 1.8977981
> 6  2.1012112 1.8177609
> 7  1.8850520 1.8325657
> 8  1.8906736 2.4831670
> 9  2.8871096 1.0268176
> 10 2.8219924 0.9018628
> 11 2.9045615 0.9118904
> 12 2.8003227 1.0919179

distxy <-dist(dataFrame); distxy # Pairwise distance, default Euclidean distance

>             1          2          3          4          5          6          7
> 2  0.34120511                                                                  
> 3  0.57493739 0.24102750                                                       
> 4  0.26381786 0.52578819 0.71861759                                            
> 5  1.69424700 1.35818182 1.11952883 1.80666768                                 
> 6  1.65812902 1.31960442 1.08338841 1.78081321 0.08150268                      
> 7  1.49823399 1.16620981 0.92568723 1.60131659 0.21110433 0.21666557           
> 8  1.99149025 1.69093111 1.45648906 2.02849490 0.61704200 0.69791931 0.65062566
> 9  2.13629539 1.83167669 1.67835968 2.35675598 1.18349654 1.11500116 1.28582631
> 10 2.06419586 1.76999236 1.63109790 2.29239480 1.23847877 1.16550201 1.32063059
> 11 2.14702468 1.85183204 1.71074417 2.37461984 1.28153948 1.21077373 1.37369662
> 12 2.05664233 1.74662555 1.58658782 2.27232243 1.07700974 1.00777231 1.17740375
>             8          9         10         11
> 2                                             
> 3                                             
> 4                                             
> 5                                             
> 6                                             
> 7                                             
> 8                                             
> 9  1.76460709                                 
> 10 1.83517785 0.14090406                      
> 11 1.86999431 0.11624471 0.08317570           
> 12 1.66223814 0.10848966 0.19128645 0.20802789

hClustering <- hclust(distxy); plot(hClustering)

To create pretty dendograms:

myplclust(hClustering, lab = hClustering$order, lab.col = rep(1:3, each =4))

K-means

kmeans Perform k-means clustering on a data matrix. kmeans other source

kmeansobj <- kmeans(dataFrame, centers = 3, iter.max = 10)
kmeansobj$cluster #for each data, which cluster is in.

>  [1] 3 1 1 3 2 2 2 2 2 2 2 2

plot(x,y, col= kmeansobj$cluster, pch = 19); points(kmeansobj$centers, col=1:3, pch=3)

Statistics

Simulation

Intro

rnorm()
dnorm()
pnorm()
rnorm()
sample()

Distributions

There are different functions for probability distribution in R:

rnorm generates random variable Normal variables with a given mean and standard deviation
dnorm evaluate the normal probability density with a given mean or sd at a point
pnorm evaluate the cumulative distribution function for a norma distribution
rpois generate random poisson variates with a given rate

Hay diferentes functiones o letras asociadas: p: distribucion cumulativa, q: funcion cuantil, r: generador de numeros aleatorios, d: densidad. En orden: rnorm, dnorm

> function (n, mean = 0, sd = 1)

> function (x, mean = 0, sd = 1, log = FALSE)

> function (q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)

> function (n, lambda)

Para reajustar o reestablecer el generador de numeros se utiliza set.seed(1).

Random Sampling

The function sample draws ramdomly from a specified set of sclara objects allowing to sample from arbitry distributions.

str(sample) # con repeat se puede obtener numeros repetidos

> function (x, size, replace = FALSE, prob = NULL)

set.seed(1)
sample(1:10,4)

> [1] 9 4 7 1

Funciones creadas

Text-Telegram

Returns a text with START at the begining and STOP at the end

telegram <- function(...){paste("START", ..., "STOP")}

Regression Models

Intro

Functions
- lm()
- melt()
- lm()
- lm()

Least Squares

In the 1880’s, Francis Galton was developing ways to quantify the heritability of traits. As part of this work, he collected data on the heights of adult children and their parents. We use melt to reshape the table

library(reshape); 
data(galton); 
str(galton)

> 'data.frame': 928 obs. of  2 variables:
>  $ child : num  61.7 61.7 61.7 61.7 61.7 62.2 62.2 62.2 62.2 62.2 ...
>  $ parent: num  70.5 68.5 65.5 64.5 64 67.5 67.5 67.5 66.5 66.5 ...

head(galton)

>   child parent
> 1  61.7   70.5
> 2  61.7   68.5
> 3  61.7   65.5
> 4  61.7   64.5
> 5  61.7   64.0
> 6  62.2   67.5

long <- melt(galton)

> Using  as id variables

head(long)

>   variable value
> 1    child  61.7
> 2    child  61.7
> 3    child  61.7
> 4    child  61.7
> 5    child  61.7
> 6    child  62.2

g <- ggplot(long, aes(x = value, fill = variable)) 
g <- g + geom_histogram(colour = "black", binwidth=1) 
g <- g + facet_grid(. ~ variable) + labs("Histogram of childs and parents heigths")
g

The command for simple linear regression is lm(response variable~explanatory variable). lm() is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance. Its the same calculating the betas or using lm().

When fitting a linear model y ~ x - 1 specifies a line through the origin. This is useful in some cases

y <- galton$child
x <- galton$parent
beta1 <- cor(y, x) *  sd(y) / sd(x)
beta0 <- mean(y) - beta1 * mean(x)
rbind(c(beta0, beta1), coef(lm(y ~ x))) #Same using both procedures

>      (Intercept)         x
> [1,]    23.94153 0.6462906
> [2,]    23.94153 0.6462906

Si se normalizan los datos, entonces, la pendiente es la correlacion. La correlacion de x y y es la misma que de los valores normalizados y de la ecuacion de la recta que pasa por los ultimos. Pero el coeficiente no es el mismo que cuando pasa por datos no normalizados.

yn <- (y - mean(y))/sd(y)
xn <- (x - mean(x))/sd(x)
c(cor(y, x), cor(yn, xn), coef(lm(yn ~ xn))[2])

>                            xn 
> 0.4587624 0.4587624 0.4587624

Add regression line to plot

Exercise

Consider the data set given below. Give the value of mu that minimazes the least squares equation

x <- c(0.18, -1.54, 0.42, 0.95)
w <- c(2, 1, 3, 1)

mean(w*x)/mean(w)

> [1] 0.1471429

Consider the data set. Fit the regression through the origin and get the slope treating y as the outcome and x as the regressor. (Hint, do not center the data since we want regression through the origin, not through the means of the data.)

x <- c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42)
y <- c(1.39, 0.72, 1.55, 0.48, 1.19, -1.59, 1.23, -0.65, 1.49, 0.05)

lm(formula = y ~ x - 1)

> 
> Call:
> lm(formula = y ~ x - 1)
> 
> Coefficients:
>      x  
> 0.8263

summary(lm(y~x-1))

> 
> Call:
> lm(formula = y ~ x - 1)
> 
> Residuals:
>     Min      1Q  Median      3Q     Max 
> -2.0692 -0.2536  0.5303  0.8592  1.1286 
> 
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> x   0.8263     0.5817   1.421    0.189
> 
> Residual standard error: 1.094 on 9 degrees of freedom
> Multiple R-squared:  0.1831,  Adjusted R-squared:  0.09238 
> F-statistic: 2.018 on 1 and 9 DF,  p-value: 0.1892

Do data(mtcars) from the datasets package and fit the regression model with mpg as the outcome and weight as the predictor. Give the slope coefficient.

data(mtcars)
head(mtcars)

>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

x <- mtcars$wt
y <- mtcars$mpg

lm(y ~ x)

> 
> Call:
> lm(formula = y ~ x)
> 
> Coefficients:
> (Intercept)            x  
>      37.285       -5.344

summary(lm(mpg ~ wt, mtcars)) #anothes form to write this

> 
> Call:
> lm(formula = mpg ~ wt, data = mtcars)
> 
> Residuals:
>     Min      1Q  Median      3Q     Max 
> -4.5432 -2.3647 -0.1252  1.4096  6.8727 
> 
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|)    
> (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
> wt           -5.3445     0.5591  -9.559 1.29e-10 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Residual standard error: 3.046 on 30 degrees of freedom
> Multiple R-squared:  0.7528,  Adjusted R-squared:  0.7446 
> F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

What is the value of the first measurement if x were normalized (to have mean 0 and variance 1)?

x <- c(8.58, 10.46, 9.01, 9.64, 8.86)
xn <- (x-mean(x))/sd(x)
xn

> [1] -0.9718658  1.5310215 -0.3993969  0.4393366 -0.5990954

What value minimizes the sum of the squared distances between these points and itself? For least squares the mean minimizes.

x <- c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42)
mean(x)

> [1] 0.573

x <- c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42)
y <- c(1.39, 0.72, 1.55, 0.48, 1.19, -1.59, 1.23, -0.65, 1.49, 0.05)

lm(y~x)

> 
> Call:
> lm(formula = y ~ x)
> 
> Coefficients:
> (Intercept)            x  
>       1.567       -1.713

Class_Notes

Alejandra Barrera

6/21/2020

Data Science - Coursera

Generalities

Vectores, Matrices, listas, factores, dataframe…

Intro

Vectors

Lists

Matrices

Factors

Data Frame

Missing values

Subsetting

Intro

%in%

Vector

Matrix

Lists

Estructuras de control

Intro

If Else

for

While

Funciones

Intro

Forma general

Funciones Anonimas

Sequences

Funciones logicas

Index

Looping functions

Intro

lapply

sapply

apply

mapply

tapply

split

Regular Expressions - Texto

Intro

Texto - General

Metacharacters:

grep() - grepl()

regexpr() - regexec()

gsub() - sub()

Useful functions

nchar()

Data Tables - Data Frames

Intro

subsetting

Folders and Files

Intro

Files

Reading files

dplyr Package

Intro

Select

Filter

Arrange

Rename

Mutate

Group_by - ddply

Merge

Plotting ang Graphs

Graphs

Intro

Plots

Base Plotting Functions

Subplot

Histogram

Box plot

Lattice system

ggplot2

Color

Clustering

Intro

Hierarchical clustering

K-means

Statistics

`%in%`