Notas para la futura Alejandra. Recuerda SIEMPRE estos pasos: Guardar->Commit->Push (asi se hayan desaparecido los archivos) y ya. Puedes usar esta pagina donde estan las referencias de R Markdown. Se puede usar Esta pagina para cosas utiles.
Markdown Sheets
Markdown Info
R Cheat Sheets
Para borrar la consola cntrl+L y para borrar todas las variables rm(list=ls()). Para tener informacion de una funcion se pone “?funcion”
There are different types of classes of objetcs. R has five basic or atomic classes of objects:
L something like 5L)R objects can have attributes which are like metadata. Attributes can be accessed using attributes()
Para hacer Debugging:
c()vector()names()class()dimnames()list()factors()levels()is.na()/ is.nan()Vectors can have only one class of objects, when different objects are mixed in a vector, coercion occurs so that every element in the vector is of the same class. To create a vector:
x<- c(0.5, 0.6) #Como si c significara concatenar
print(x)
> [1] 0.5 0.6
names(x)<-c("Food","Bar")
x
> Food Bar
> 0.5 0.6
Pueden haber distintos objetos (numericos, logicos, enteros, etc)
x<-vector("numeric", length=10)
x
> [1] 0 0 0 0 0 0 0 0 0 0
Also, different modes can be used such as “logical”, “integer”, “numeric” (synonym “double”), “complex”, “character” and “raw”.
*** Para saber la clase entonces se usa class()
class(x)
> [1] "numeric"
x<-0:6
class(x)
> [1] "integer"
as.numeric(x)
> [1] 0 1 2 3 4 5 6
as.character(x)
> [1] "0" "1" "2" "3" "4" "5" "6"
as.logical(x)
> [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE
my_vector <-1:20
dim(my_vector)
> NULL
dim(my_vector) <- c(4,5) #rows and cols
my_vector # now is a matrix
> [,1] [,2] [,3] [,4] [,5]
> [1,] 1 5 9 13 17
> [2,] 2 6 10 14 18
> [3,] 3 7 11 15 19
> [4,] 4 8 12 16 20
Are a special type of vectors that contain elements of different classes. To create one:
list()x<-list(1, "a", TRUE, 1+4i) #Recuerda que los numeros que cambian son los indices de los vectores, tienen dobles corchetes cuadrados
print(x)
> [[1]]
> [1] 1
>
> [[2]]
> [1] "a"
>
> [[3]]
> [1] TRUE
>
> [[4]]
> [1] 1+4i
vector()x <- vector("list", length = 5)
Tambien pueden tener nombres
x<-list(a=1,b= "a",c= TRUE,d= 1+4i)
x
> $a
> [1] 1
>
> $b
> [1] "a"
>
> $c
> [1] TRUE
>
> $d
> [1] 1+4i
Para eliminar una lista se puede usar unlist()
x$b
> [1] "a"
Matrices are vectores with a dimmension attribute. Se construyen column wise es decir, se llenan las filas de arriba para abajo una columna a la vez.
m<- matrix(1:6, nrow=2, ncol=3) # Se construye column wise
m
> [,1] [,2] [,3]
> [1,] 1 3 5
> [2,] 2 4 6
dim(m)
> [1] 2 3
attributes(m) #Todo lo que se le pueda sacar
> $dim
> [1] 2 3
m[1,] #Para extraer solo la primera fila
> [1] 1 3 5
rownames() or colnames() can be useddimnames(m) <- list(c("fila1","fila2"),c("col1","col2","col3")) #Se usa una listaaa
m
> col1 col2 col3
> fila1 1 3 5
> fila2 2 4 6
m<-1:10
dim(m) <- c(2,5) #dos filas, dos columnas
m
> [,1] [,2] [,3] [,4] [,5]
> [1,] 1 3 5 7 9
> [2,] 2 4 6 8 10
x<-1:3
y<-10:12
cbind(x,y) # Los une por columnas
> x y
> [1,] 1 10
> [2,] 2 11
> [3,] 3 12
print("Tambien se puede unir por filas")
> [1] "Tambien se puede unir por filas"
rbind(x,y)
> [,1] [,2] [,3]
> x 1 2 3
> y 10 11 12
One can think of a factor as an integer vector where each integer has a label. Factors are automatically created when read.table() is used. Para tener datos categoricos:
x<-factor(c("yes","no","yes", "no"), levels = c("yes","no")) #Para determinar el orden de los niveles.
x
> [1] yes no yes no
> Levels: yes no
table(x) #Conteo
> x
> yes no
> 2 2
class(x)
> [1] "factor"
Are used to stored tabular data in R. dplyr is a package optimized for data frames. Are represented as a special type of list where every element of the list has to have the same length. Can store different classes. Generally are read using functions such as read.table() or read.csv()
x<-data.frame(food=1:4, bar=c(T,T,F,F))
x
> food bar
> 1 1 TRUE
> 2 2 TRUE
> 3 3 FALSE
> 4 4 FALSE
nrow(x)
> [1] 4
ncol(x)
> [1] 2
names() can be used tu columnscolnames(x)[c(1,2)]<-c("Name","Mortality"); x
> Name Mortality
> 1 1 TRUE
> 2 2 TRUE
> 3 3 FALSE
> 4 4 FALSE
x<-data.frame(food=1:4, bar=c(T,T,F,F))
names(x) <- c("Name","Mortality")
To set the names in the rows use row.names()
Missing values are denoted by NA or NaN for undefined mathematical operations.
is.na() is used to test objects if they are NAis.nan() is used to test for NaNThe [ operator always returns an object of the same class as the original. It can be used to select multiple elements of an object
The [[ operator is used to extract elements of a list or a data frame. It can only be used to extract a single element and the class of the returned object will not necessarily be a list or data frame.
The $ operator is used to extract elements of a list or data frame by literal name. Its semantics are similar to that of [[.
Functions
c()vector()names()class()dimnames()list()factors()levels()is.na()/ is.nan()identical()%in%v1 <- 3
v2 <- 101
t <- c(1,2,3,4,5,6,7,8)
v1 %in% t
> [1] TRUE
t %in% v1 #Its better big first.
> [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
x <- c("a", "b", "c", "c", "d", "a")
x[1] ## Extract the first element
> [1] "a"
x[c(1, 3, 4)] # The sequences doesnt has to be in order
> [1] "a" "c" "c"
x[-c(1, 3, 4)] # All elements but those
> [1] "b" "d" "a"
We can also pass a logical sequence to the [ operator to extract elements of a vector that satisfy a given condition. For example, here we want the elements of x that come lexicographically after the letter “a”.
x[x > "a"]
> [1] "b" "c" "c" "d"
identical(x,c(1,2))
> [1] FALSE
x <- matrix(1:6, 2, 3)
x[1, 2]
> [1] 3
x[1, , drop = FALSE] #if the answer has to be a matrix
> [,1] [,2] [,3]
> [1,] 1 3 5
x <- list(foo=1:4, bar=0.6) #lista
x[1] #Lista
> $foo
> [1] 1 2 3 4
x[[1]]
> [1] 1 2 3 4
class(x[2])
> [1] "list"
class(x[[2]])
> [1] "numeric"
x$foo
> [1] 1 2 3 4
x[[1]][[2]] #extraer un elemento de la lista
> [1] 2
x <- list(a = list(10, 12, 14), b = c(3.14, 2.81)); x
> $a
> $a[[1]]
> [1] 10
>
> $a[[2]]
> [1] 12
>
> $a[[3]]
> [1] 14
>
>
> $b
> [1] 3.14 2.81
## Get the 3rd element of the 1st element
x[[c(1, 3)]] #or x[[1]][[3]]
> [1] 14
condition<-TRUE
if(condition==TRUE) {
## do something
print(TRUE)
} else #else if
{ print(FALSE)
## do something else
}
> [1] TRUE
Se puede usar next para cambiar de iteracion o para salir.
for(i in 1:10){
print(i)
}
> [1] 1
> [1] 2
> [1] 3
> [1] 4
> [1] 5
> [1] 6
> [1] 7
> [1] 8
> [1] 9
> [1] 10
## otra forma de escribirlo
for(i in 1:4) print(i)
> [1] 1
> [1] 2
> [1] 3
> [1] 4
## Para crear una sequencia
seq(5)
> [1] 1 2 3 4 5
Se puede reemplazar con un repeat pero hay que usar un break
count<-0
while(count<10) {
print(count)
count<-count+1
}
> [1] 0
> [1] 1
> [1] 2
> [1] 3
> [1] 4
> [1] 5
> [1] 6
> [1] 7
> [1] 8
> [1] 9
To see the source code , just type the function without parenthesis. To see the arguments type args(function)
seq()seq_along()isTRUE()xor()all()any()which()above_10 <- function (x, n=10){
use <- x>n
x[use]
# Se puede especificar default values.
}
above_10(c(1,4,6,8,99))
> [1] 99
evaluate <- function(func, dat){func(dat)}
evaluate(function(x){x[1]}, c(8, 4, 0)) #return the first element of the vector
> [1] 8
evaluate(function(x){x[length(x)]}, c(8, 4, 0)) #return the last element of the vector
> [1] 0
How to create a sequence
seq(1,10, by=2)
> [1] 1 3 5 7 9
seq(1,10,length=3)
> [1] 1.0 5.5 10.0
seq(along=c(1,3,8,25,100)) #create a vector with the length of the vector
> [1] 1 2 3 4 5
seq_along(c(1,3,8,25,100)) #Same
> [1] 1 2 3 4 5
The & is first evaluated than |. && or || evaluates only the first argument. * isTRUE()
isTRUE(3>2)
> [1] TRUE
xor() function stands for exclusive OR. If one argument evaluates to TRUE and one argument evaluates to FALSE, then this function will return TRUE, otherwise it will return FALSE.xor(5==6, !FALSE)
> [1] TRUE
xor(TRUE, TRUE)
> [1] FALSE
ints <- sample(10)
any(ints<0)
> [1] FALSE
all(ints>0)
> [1] TRUE
which()ints <-sample(10)
which(ints>5)
> [1] 1 4 5 6 9
lapply()sapply()apply()mapply() vectorize()tapply()split()The lapply() function does the following simple series of operations:
1. it loops over a list, iterating over each element in that list
2. it applies a function to each element of the list (a function that you specify)
3. and returns a list (the l is for “list”).
x<-list(a=matrix(1:4, 2,2), b=matrix(1:6,3,2)); x
> $a
> [,1] [,2]
> [1,] 1 3
> [2,] 2 4
>
> $b
> [,1] [,2]
> [1,] 1 4
> [2,] 2 5
> [3,] 3 6
lapply(x, mean)
> $a
> [1] 2.5
>
> $b
> [1] 3.5
When you pass a function to lapply(), lapply() takes elements of the list and passes them as the first argument of the function you are applying.
Si se tiene una lista, se va a aplicar la funcion a toda la lista.
library(datasets)
data(iris)
iris[1:5,]
> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
> 1 5.1 3.5 1.4 0.2 setosa
> 2 4.9 3.0 1.4 0.2 setosa
> 3 4.7 3.2 1.3 0.2 setosa
> 4 4.6 3.1 1.5 0.2 setosa
> 5 5.0 3.6 1.4 0.2 setosa
lapply(iris,mean)
> Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
> returning NA
> $Sepal.Length
> [1] 5.843333
>
> $Sepal.Width
> [1] 3.057333
>
> $Petal.Length
> [1] 3.758
>
> $Petal.Width
> [1] 1.199333
>
> $Species
> [1] NA
lapply(x, function(elt) { elt[,1] }) #Elt is the argument, i.e the list
> $a
> [1] 1 2
>
> $b
> [1] 1 2 3
You can put an arbitrarily complicated function definition inside lapply(), but if it’s going to be more complicated, it’s probably a better idea to define the function separately.
The sapply() function behaves similarly to lapply(); the only real difference is in the return value. sapply() will try to simplify the result of lapply() if possible. Essentially, sapply() calls lapply() on its input and then applies the following algorithm:
x<-list(a=matrix(1:4, 2,2), b=matrix(1:6,3,2)); x
> $a
> [,1] [,2]
> [1,] 1 3
> [2,] 2 4
>
> $b
> [,1] [,2]
> [1,] 1 4
> [2,] 2 5
> [3,] 3 6
sapply(x, mean)
> a b
> 2.5 3.5
library(datasets)
s <- split(airquality, airquality$Month)
sapply(s, function(x) {colMeans(x[, c("Ozone", "Solar.R", "Wind")],na.rm = TRUE)})
> 5 6 7 8 9
> Ozone 23.61538 29.44444 59.115385 59.961538 31.44828
> Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333
> Wind 11.62258 10.26667 8.941935 8.793548 10.18000
The apply() function is used to a evaluate a function (often an anonymous one) over the margins of an array. It is most often used to apply a function to the rows or columns of a matrix (which is just a 2-dimensional array).
str(apply)
> function (X, MARGIN, FUN, ...)
Por ejemplo:
x<-matrix(rnorm(200),20,10)
apply(x,2,mean) #dimension 2 is cols. El promedio se hace vertical
> [1] 0.17307975 0.07813620 -0.21273428 0.35048079 0.01849736 0.17894928
> [7] -0.26329213 0.19919152 -0.16809101 0.03871802
apply(x,1,sum) #Sum through the rows
> [1] -4.36190992 5.88912273 -3.74986586 -0.46952941 -2.26029643 -1.72725736
> [7] 4.79822344 -4.47910605 -0.06277095 -0.80749994 0.88526970 2.36069841
> [13] -2.47805374 2.36070808 -1.20829089 0.74987462 4.37266752 2.53142014
> [19] 4.36210080 1.15320514
Hay funciones como rowsum, o colsum que pueden reducir el codigo.
The mapply() function is a multivariate apply of sorts which applies a function in parallel over a set of arguments. Recall that lapply() and friends only iterate over a single R object.
str(mapply)
> function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
Por ejemplo, en vez de usar list(rep(1,4), rep(2,3), rep(3,2), rep(4,1)) se puede usar:
mapply(rep,1:4,4:1)
> [[1]]
> [1] 1 1 1 1
>
> [[2]]
> [1] 2 2 2
>
> [[3]]
> [1] 3 3
>
> [[4]]
> [1] 4
mapply() can be used to vectorize a function:sumsq <- function(mu, sigma, x) {sum(((x - mu) / sigma)^2)}
x <- rnorm(100) ## Generate some data
sumsq(1:10, 1:10, x) #not the result that we want
> [1] 104.2627
mapply(sumsq, 1:10, 1:10, MoreArgs = list(x = x)) #We can use also vectorize
> [1] 218.2807 136.8669 119.6283 112.8651 109.4011 107.3393 105.9878 105.0405
> [9] 104.3429 103.8096
tapply() is used to apply a function over subsets of a vector. It can be thought of as a combination of split() and sapply() for vectors only. I’ve been told that the “t” in tapply() refers to “table”, but that is unconfirmed.
str(tapply)
> function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)
x<-c(rnorm(10), runif(10),rnorm(10))
f<-gl(3,10) #Genera los niveles de los factores gl(n,k) n es el numero de niveles y k el numero de replicas
tapply(x,f,mean)
> 1 2 3
> 0.2826848 0.3490682 -0.1937033
data("mtcars")
mtcars[1:5,]
> mpg cyl disp hp drat wt qsec vs am gear carb
> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
tapply(mtcars$mpg,mtcars$cyl,mean)
> 4 6 8
> 26.66364 19.74286 15.10000
split takes a vector or other objetc and splits it into groups determined by a factor of list of factors
str(split)
function(x, f, drop=FALSE,... )
x<-c(rnorm(10), runif(10),rnorm(10))
f<-gl(3,10) #Genera los niveles de los factores gl(n,k) n es el numero de niveles y k el numero de replicas
split(x,f) #siempre devuelve una lista. Se puede usar con otra function
> $`1`
> [1] 0.008569759 0.477261567 1.931180644 -0.936025291 -2.179742864
> [6] 0.114505524 -1.050862210 -0.096015922 -1.214242667 -0.199772953
>
> $`2`
> [1] 0.9795859 0.4812531 0.3831788 0.6900134 0.7154081 0.9284080 0.1054450
> [8] 0.9040322 0.3522041 0.9960455
>
> $`3`
> [1] -1.19798852 -0.86591087 -2.21173945 1.15170942 1.27622749 0.32057739
> [7] 0.97449109 0.49393589 0.06746323 1.23617451
lapply(split(x,f),mean)
> $`1`
> [1] -0.3145144
>
> $`2`
> [1] 0.6535574
>
> $`3`
> [1] 0.124494
library(datasets)
s <- split(airquality, airquality$Month)
sapply(s, function(x) {colMeans(x[, c("Ozone", "Solar.R", "Wind")],na.rm = TRUE)})
> 5 6 7 8 9
> Ozone 23.61538 29.44444 59.115385 59.961538 31.44828
> Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333
> Wind 11.62258 10.26667 8.941935 8.793548 10.18000
paste()strcmp()tolower()-toUpper()strsplit()grep() grepl()sapply()apply()mapply() vectorize()tapply()split()To read text files use readLines(...txt) * Para convertir vectores a caracteres unidos
paste(as.character(c(rep(0,times=3-length(1)),1)), collapse ="")
> [1] "001"
paste0(as.character(c(rep(0,times=3-length(1)),1)), collapse = "")
> [1] "001"
library(pracma)
strcmpi("Hola","hola") #case insensitive
> [1] TRUE
strcmp("Hola","hola") #case sensitive
> [1] FALSE
tolower("HOLA")
> [1] "hola"
testName <- "HOLA.Como.Estas"
strsplit(testName,"\\.")
> [[1]]
> [1] "HOLA" "Como" "Estas"
gsub("[[:punct:]]"," ",testName) #removes all puntuaction
> [1] "HOLA Como Estas"
gsub("[.]"," ",testName) #tambien se puede [-]
> [1] "HOLA Como Estas"
Metacharacters * ^ para buscar al inicio * $ final * [Bb] permitir las mayusculas y minusculas. * ? Para hasta el primer match. Ver regexpr()
library(datasets)
grep("^New", state.name) #Index
> [1] 29 30 31 32
grep("^New", state.name, value = TRUE) #Values
> [1] "New Hampshire" "New Jersey" "New Mexico" "New York"
grep() is used to match a label into a character vector. Returns a vector of the indices of the elements of x that yielded a match. By setting value = TRUE the output is the real value.
> function (pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
> fixed = FALSE, useBytes = FALSE, invert = FALSE)
g<- grep("Cause: shooting", homicides);length(g)
> [1] 228
g<- grep("Cause: [Ss]hooting", homicides);length(g) #capital insensitive
> [1] 1003
The function grepl() works much like grep() except that it differs in its return value. grepl() returns a logical vector indicating which element of a character vector contains the match.
> function (pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE,
> useBytes = FALSE)
g<-grepl("^New", state.name);g
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> [25] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
> [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> [49] FALSE FALSE
state.name[g]
> [1] "New Hampshire" "New Jersey" "New Mexico" "New York"
> function (pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE,
> useBytes = FALSE)
The regexpr() function gives you the (a) index into each string where the match begins and the (b) length of the match for that string. regexpr() only gives you the first match of the string (reading left to right). gregexpr() will give you all of the matches in a given string if there are is more than one match.
regexpr("<dd>[F|f]ound(.*)</dd>", homicides[1]); r<-regexpr("<dd>[F|f]ound(.*)</dd>", homicides[1])
> [1] 177
> attr(,"match.length")
> [1] 93
> attr(,"index.type")
> [1] "chars"
> attr(,"useBytes")
> [1] TRUE
regmatches(homicides[1],r)
> [1] "<dd>Found on January 1, 2007</dd><dd>Victim died at Shock Trauma</dd><dd>Cause: shooting</dd>"
regexpr("<dd>[F|f]ound(.*?)</dd>", homicides[1]); r<-regexpr("<dd>[F|f]ound(.*?)</dd>", homicides[1])
> [1] 177
> attr(,"match.length")
> [1] 33
> attr(,"index.type")
> [1] "chars"
> attr(,"useBytes")
> [1] TRUE
regmatches(homicides[1],r)
> [1] "<dd>Found on January 1, 2007</dd>"
The regexec() function works like regexpr() except it gives you the indices for parenthesized subexpressions.
r<-regexec("<dd>[F|f]ound on (.*?)</dd>", homicides[1])
regmatches(homicides[1], r)
> [[1]]
> [1] "<dd>Found on January 1, 2007</dd>" "January 1, 2007"
sub and gsub return a character vector of the same length and with the same attributes as x (after possible coercion to character).
x <- substr(homicides[1], 177, 177 + 33 - 1); x
> [1] "<dd>Found on January 1, 2007</dd>"
sub("<dd>[F|f]ound on |</dd>", "", x)
> [1] "January 1, 2007</dd>"
gsub("<dd>[F|f]ound on |</dd>", "", x)
> [1] "January 1, 2007"
nchar Takes a character vector as an argument and returns a vector whose elements contain the sizes of the corresponding elements of x. Its really fast. Returns a data.frame in a vector is introduced.
x <- c("asfef", "qwerty", "yuiop[", "b", "stuff.blah.yech")
nchar(x)
> [1] 5 6 6 1 15
subset()Creates a subset of a data table
library(datasets)
head(subset(iris, Species=="setosa"))
> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
> 1 5.1 3.5 1.4 0.2 setosa
> 2 4.9 3.0 1.4 0.2 setosa
> 3 4.7 3.2 1.3 0.2 setosa
> 4 4.6 3.1 1.5 0.2 setosa
> 5 5.0 3.6 1.4 0.2 setosa
> 6 5.4 3.9 1.7 0.4 setosa
class(subset(iris, Species=="setosa"))
> [1] "data.frame"
download.file()read.table()Connection with outside world
file.exist Check if files exist. Returns TRUE or FALSEdir.create Create directory if it doesnt exist.download.file> function (url, destfile, method, quiet = FALSE, mode = "w", cacheOK = TRUE,
> extra = getOption("download.file.extra"), headers = NULL, ...)
Ejemplo
url <-"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
destfile <-"Ejercicio_Semana3/US_communities.csv"
download.file(url, destfile)
Hay diferentes funciones para leer los datos:
read.table() not so useful for big data.
sep =",", header=TRUE.na.strings set the character that represents a missing value initial<- read.table("Nombredelarchivo.txt", nrows=100)
classes <- sapply(initial,class)
tabAll< read.table("Nombredelarchivo.txt", colClasses=classes)
+ Especificar nrows, se puede sobre estimar
+ Puede hacer el calculo de la memoria que va a gastar.
library(xlsx) es un paquete para leer excel. Tambien XLConnect tiene mas opcionesread.xlsx. Esta funcion permite leer columnas y filas especificas.write.xlsxdata.table es un paquete
tables() todas las tablas del documentocopyIts useful for working with data frames.
select()filter()arrange()rename()mutate()group_by()merge()> function (.data, ...)
head(select(iris, Sepal.Length:Petal.Length))
> Sepal.Length Sepal.Width Petal.Length
> 1 5.1 3.5 1.4
> 2 4.9 3.0 1.4
> 3 4.7 3.2 1.3
> 4 4.6 3.1 1.5
> 5 5.0 3.6 1.4
> 6 5.4 3.9 1.7
head(select(iris, -Sepal.Length))
> Sepal.Width Petal.Length Petal.Width Species
> 1 3.5 1.4 0.2 setosa
> 2 3.0 1.4 0.2 setosa
> 3 3.2 1.3 0.2 setosa
> 4 3.1 1.5 0.2 setosa
> 5 3.6 1.4 0.2 setosa
> 6 3.9 1.7 0.4 setosa
Subset rows given a condition.
> function (.data, ..., .preserve = FALSE)
head(filter(iris, Petal.Width>0.2))
> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
> 1 5.4 3.9 1.7 0.4 setosa
> 2 4.6 3.4 1.4 0.3 setosa
> 3 5.7 4.4 1.5 0.4 setosa
> 4 5.4 3.9 1.3 0.4 setosa
> 5 5.1 3.5 1.4 0.3 setosa
> 6 5.7 3.8 1.7 0.3 setosa
Order the rows
> function (.data, ..., .by_group = FALSE)
head(arrange(iris, Petal.Width)) #se puede usar desc(Petal.Width) para el orden descendiente
> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
> 1 4.9 3.1 1.5 0.1 setosa
> 2 4.8 3.0 1.4 0.1 setosa
> 3 4.3 3.0 1.1 0.1 setosa
> 4 5.2 4.1 1.5 0.1 setosa
> 5 4.9 3.6 1.4 0.1 setosa
> 6 5.1 3.5 1.4 0.2 setosa
tail(arrange(iris, Petal.Width))
> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
> 145 5.8 2.8 5.1 2.4 virginica
> 146 6.3 3.4 5.6 2.4 virginica
> 147 6.7 3.1 5.6 2.4 virginica
> 148 6.3 3.3 6.0 2.5 virginica
> 149 7.2 3.6 6.1 2.5 virginica
> 150 6.7 3.3 5.7 2.5 virginica
Rename cols
> function (.data, ...)
head(rename(iris, Petal_Width=Petal.Width)) #se puede usar desc(Petal.Width) para el orden descendiente
> Sepal.Length Sepal.Width Petal.Length Petal_Width Species
> 1 5.1 3.5 1.4 0.2 setosa
> 2 4.9 3.0 1.4 0.2 setosa
> 3 4.7 3.2 1.3 0.2 setosa
> 4 4.6 3.1 1.5 0.2 setosa
> 5 5.0 3.6 1.4 0.2 setosa
> 6 5.4 3.9 1.7 0.4 setosa
Create or adds new variables
> function (.data, ...)
head(mutate(iris, Petal_std=Petal.Width-mean(Petal.Width))) #se puede usar desc(Petal.Width) para el orden descendiente
> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal_std
> 1 5.1 3.5 1.4 0.2 setosa -0.9993333
> 2 4.9 3.0 1.4 0.2 setosa -0.9993333
> 3 4.7 3.2 1.3 0.2 setosa -0.9993333
> 4 4.6 3.1 1.5 0.2 setosa -0.9993333
> 5 5.0 3.6 1.4 0.2 setosa -0.9993333
> 6 5.4 3.9 1.7 0.4 setosa -0.7993333
Group by categories. Es una especie de tabla dinamica.
> function (.data, ..., .add = FALSE, .drop = group_by_drop_default(.data))
group<-group_by(iris, Species) #se puede usar desc(Petal.Width) para el orden descendiente
summarise(group, Petal_meanWidth = mean(Petal.Width), Max_Sepal_length = max(Sepal.Length))
> `summarise()` ungrouping output (override with `.groups` argument)
> # A tibble: 3 x 3
> Species Petal_meanWidth Max_Sepal_length
> <fct> <dbl> <dbl>
> 1 setosa 0.246 5.8
> 2 versicolor 1.33 7
> 3 virginica 2.03 7.9
ddply() Split Data Frame, Apply Function, And Return Results In A Data Frame.
library(plyr)
> Warning: package 'plyr' was built under R version 4.0.2
> ------------------------------------------------------------------------------
> You have loaded plyr after dplyr - this is likely to cause problems.
> If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
> library(plyr); library(dplyr)
> ------------------------------------------------------------------------------
>
> Attaching package: 'plyr'
> The following objects are masked from 'package:dplyr':
>
> arrange, count, desc, failwith, id, mutate, rename, summarise,
> summarize
ans <- ddply(iris, .(Species), function(x) {h<-max(x$Sepal.Length);
p<-mean(x$Petal.Width);ans<-cbind(h,p)})
names(ans)<-c("Species","Petal_meanWidth","Max_Sepal_length"); ans
> Species Petal_meanWidth Max_Sepal_length
> 1 setosa 5.8 0.246
> 2 versicolor 7.0 1.326
> 3 virginica 7.9 2.026
Merge with common id. merge, join, join_all
> function (.data, ..., .add = FALSE, .drop = group_by_drop_default(.data))
group<-group_by(iris, Species) #se puede usar desc(Petal.Width) para el orden descendiente
summarise(group, Petal_meanWidth=mean(Petal.Width), Max_Sepal_length=max(Sepal.Length))
> Petal_meanWidth Max_Sepal_length
> 1 1.199333 7.9
Important Save the file as svg. (scalable vector). So it wont lose resolution.
par()plot() - hist()boxplot()In the base system, first a graph has to be initialized and then it has to be annotated.
par() creates global parameters of the all graphics. Can be overridden. Anotation like adding lines, points, legends, etc.pch the plotting symbol if its ’*" or ‘.’. 1 is circlelty line typelwd line widthcol plotting color, number, string or hexcodexlab and ylab labelsmar: A numerical vector of the form c(bottom, left, top, right) which gives the number of lines of margin to be specified on the four sides of the plot. The default is c(5, 4, 4, 2) + 0.1bg background colorlas orientation of the axismfrow ormcol to create a subplot: A vector of the form c(nr, nc). Subsequent figures will be drawn in an nr-by-nc array on the device by columns (mfcol), or rows (mfrow), respectively.oma A vector of the form c(bottom, left, top, right) giving the size of the outer margins in lines of text. default par("oma") is 0 0 0 0type =“n” makes the data invisiblelines: add lines to a plot, given a vector of x values and a corresponding vector of y values (or a 2-column matrix); this function just connects the dots
points: add points to a plot
text: add text labels to a plot using specified x, y coordinates
title: add annotations to x, y axis labels, title, subtitle, outer margin
mtext: add arbitrary text to the margins (inner or outer) of the plot
axis: adding axis ticks/labels
Can create scatterplots separated by category
with(data = iris, plot(Petal.Width, Petal.Length, col=Species))
title( main = " Petal Length vs Petal Width ")
with(data = airquality, plot(Wind, Ozone, main = " Ozone and Wind in NY", col="red")) #Title can be added to plot
with(subset(airquality, Month ==5), points(Wind, Ozone, col="blue")) # Added on top
legend("topright", pch=1, col=c("blue", "red"), legend=c("May", "Other Months"))
# Legend=topleft, toprigth, bottomleft, bottomrigth
par() can be used to set or query graphical parameterspar(mfcol=c(1,2),mar=c(4,4,2,1))
hist(subset(iris, Species == "setosa")$Sepal.Length, breaks = 10, main="Setosa sepal length", xlab="Length", col="#8BC5CC")
hist(subset(iris, Species == "virginica")$Sepal.Length, breaks = 10, main="Virginica sepal length", xlab="Length", col="#8BCCA2")
par(mfrow = c(2, 3), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0))
with(airquality,
{
plot(Wind, Ozone, main = "Ozone and Wind")
plot(Solar.R, Ozone, main = "Ozone and Solar Radiation", pch=2, type="p", col="blue", col.lab="red")
plot(Temp, Ozone, main = "Ozone and Temperature", pch=5, type="l", col.main='#296A40')
plot(Wind, Ozone, main = "Ozone and Wind", pch=10, col.axis='purple')
plot(Solar.R, Ozone, main = "Ozone and Solar Radiation", pch=".", type="h", fg="green")
plot(Temp, Ozone, main = "Ozone and Temperature", pch="a", type="b")
mtext("Ozone and Weather in New York City", outer = TRUE) #outer text
})
hist Computes a histogram of the given data values
> function (x, ...)
breaks=num sets the number of cells in the histogramhist(iris$Sepal.Length, breaks = 10)
Can add rug() to see all the points.
hist(iris$Sepal.Length, breaks = 10)
rug(iris$Sepal.Length)
SmoothScatter() produces a smoothed color density representation of a scatterplot, obtained through a (2D) kernel density estimate. Like a 2D histogram.
x <- rnorm(10000)
y <- rnorm(10000)
par(mfrow=c(1,3))
plot(x,y, pch =19); smoothScatter(x,y); plot(x,y, col = rgb(0,0,0,0.05), pch =19)
boxplot Produce box-and-whisker plot(s) of the given (grouped) values.
airquality <- transform( airquality, Month =factor(Month))
boxplot(Ozone ~ Month, airquality, xlab = "Month", ylab="Ozone ppb")
lattice Everything in one line code. How to interpret next code line: life.exp ~ Income is the same as y vs x, the | represents the condition. i.e I want to see Life.Exp vs Income per month. xyplot
library(lattice)
state<-data.frame(state.x77, region = state.region)
xyplot(Life.Exp ~ Income| region, data= state, layout=c(4,1))
airquality <- transform(airquality, Month = factor(Month))
xyplot(Ozone ~ Wind | Month, data = airquality, layout=c(5,1))
It’s worth to mention that without the transformation there would be no labels corresponding to the months
ggplot2 Mixes elements of Base and Lattice. Uses qplot().
Color in ggplot
str(qplot)
> function (x, y, ..., data, facets = NULL, margins = FALSE, geom = "auto",
> xlim = c(NA, NA), ylim = c(NA, NA), log = "", main = NULL, xlab = NULL,
> ylab = NULL, asp = NA, stat = NULL, position = NULL)
str(mpg)
> tibble [234 x 11] (S3: tbl_df/tbl/data.frame)
> $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
> $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
> $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
> $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
> $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
> $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
> $ drv : chr [1:234] "f" "f" "f" "f" ...
> $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
> $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
> $ fl : chr [1:234] "p" "p" "p" "p" ...
> $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
p1 <- qplot(displ, hwy, data=mpg)
p2 <- qplot(displ, hwy, data=mpg, color=year, geom = c("point","smooth"))
grid.arrange(p1, p2, nrow = 1)
> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
qplot using only one argument is an histogramp1 <- qplot(hwy, data=mpg, fill=drv, main ="Separated by drv")
p2 <- qplot(hwy, data=mpg, fill=manufacturer, main ="Separated by manufacturer")
grid.arrange(p1, p2, nrow = 2)
> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
facets determines the columns of the panels. i.e each panel corresponds to a variable in drv.p1 <- qplot(displ, hwy, data=mpg, facets = . ~ drv, main ="Without row definition")
p2 <- qplot(hwy, data=mpg, facets = drv ~ ., main ="Without col definition", binwidth =2)
grid.arrange(p1, p2, nrow = 1)
air <- datasets::airquality
data <-na.omit(air);
data$Temp <- as.numeric(data$Temp); data$Day <- as.numeric(data$Day);
str(data);
> 'data.frame': 111 obs. of 6 variables:
> $ Ozone : int 41 36 12 18 23 19 8 16 11 14 ...
> $ Solar.R: int 190 118 149 313 299 99 19 256 290 274 ...
> $ Wind : num 7.4 8 12.6 11.5 8.6 13.8 20.1 9.7 9.2 10.9 ...
> $ Temp : num 67 72 74 62 65 59 61 69 66 68 ...
> $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
> $ Day : num 1 2 3 4 7 8 9 12 13 14 ...
> - attr(*, "na.action")= 'omit' Named int [1:42] 5 6 10 11 25 26 27 32 33 34 ...
> ..- attr(*, "names")= chr [1:42] "5" "6" "10" "11" ...
g <- ggplot(data, aes(Day, Temp)) #aes defines the mapping. i.e the x and y
g + geom_point(aes(color = Month)) + facet_grid( . ~ Month)+ scale_color_distiller(palette="Blues")
g + geom_point(aes(color = Month), size = 2) + scale_color_distiller(palette="Spectral")
ggplot(data, aes(Day, Temp, group=Month, color=Month)) + geom_line() +geom_point() + scale_color_distiller(palette="Oranges") + theme(plot.title = element_text(hjust = 0.5)) +labs( x="Days", y="Temperature", title = "Temp vs Days")
NOTE To add limits do not add ylim instead coord_cartesian.
To see all the colors that can be called by name use colors().
Using colorRamp() a palette can be created between the given colors.
ramp <- colorRamp(c("red", "blue")) #
ramp(1) #Blue
> [,1] [,2] [,3]
> [1,] 0 0 255
ramp(0) #Red
> [,1] [,2] [,3]
> [1,] 255 0 0
ramp(seq(0,1, len = 10))
> [,1] [,2] [,3]
> [1,] 255.00000 0 0.00000
> [2,] 226.66667 0 28.33333
> [3,] 198.33333 0 56.66667
> [4,] 170.00000 0 85.00000
> [5,] 141.66667 0 113.33333
> [6,] 113.33333 0 141.66667
> [7,] 85.00000 0 170.00000
> [8,] 56.66667 0 198.33333
> [9,] 28.33333 0 226.66667
> [10,] 0.00000 0 255.00000
Using colorRampPalette() a hexadecimal system for colors can be created
pal <- colorRampPalette(c("red", "blue")) #
pal(1) #Blue
> [1] "#FF0000"
pal(10) #Red
> [1] "#FF0000" "#E2001C" "#C60038" "#AA0055" "#8D0071" "#71008D" "#5500AA"
> [8] "#3800C6" "#1C00E2" "#0000FF"
par(mfrow = c(2, 1))
hist(1:10, col=pal(10)); hist(1:10, col=pal(5))
hclust()kmeans()hclust allows to create a hierarchical clustering. The idea is that the points with smallest distant among each other are merged together.
set.seed(1234)
x <- rnorm(12, mean = rep(1:3, each =4), sd=0.2)
y <- rnorm(12, mean = rep(c(1,2,1), each =4), sd=0.2)
plot(x,y, col ='blue', pch =19); text(x+0.05,y+0.05, labels = as.character(1:12))
dataFrame <- data.frame(x=x,y=y); dataFrame
> x y
> 1 0.7585869 0.8447492
> 2 1.0554858 1.0128918
> 3 1.2168882 1.1918988
> 4 0.5308605 0.9779429
> 5 2.0858249 1.8977981
> 6 2.1012112 1.8177609
> 7 1.8850520 1.8325657
> 8 1.8906736 2.4831670
> 9 2.8871096 1.0268176
> 10 2.8219924 0.9018628
> 11 2.9045615 0.9118904
> 12 2.8003227 1.0919179
distxy <-dist(dataFrame); distxy # Pairwise distance, default Euclidean distance
> 1 2 3 4 5 6 7
> 2 0.34120511
> 3 0.57493739 0.24102750
> 4 0.26381786 0.52578819 0.71861759
> 5 1.69424700 1.35818182 1.11952883 1.80666768
> 6 1.65812902 1.31960442 1.08338841 1.78081321 0.08150268
> 7 1.49823399 1.16620981 0.92568723 1.60131659 0.21110433 0.21666557
> 8 1.99149025 1.69093111 1.45648906 2.02849490 0.61704200 0.69791931 0.65062566
> 9 2.13629539 1.83167669 1.67835968 2.35675598 1.18349654 1.11500116 1.28582631
> 10 2.06419586 1.76999236 1.63109790 2.29239480 1.23847877 1.16550201 1.32063059
> 11 2.14702468 1.85183204 1.71074417 2.37461984 1.28153948 1.21077373 1.37369662
> 12 2.05664233 1.74662555 1.58658782 2.27232243 1.07700974 1.00777231 1.17740375
> 8 9 10 11
> 2
> 3
> 4
> 5
> 6
> 7
> 8
> 9 1.76460709
> 10 1.83517785 0.14090406
> 11 1.86999431 0.11624471 0.08317570
> 12 1.66223814 0.10848966 0.19128645 0.20802789
hClustering <- hclust(distxy); plot(hClustering)
To create pretty dendograms:
myplclust(hClustering, lab = hClustering$order, lab.col = rep(1:3, each =4))
kmeans Perform k-means clustering on a data matrix. kmeans other source
kmeansobj <- kmeans(dataFrame, centers = 3, iter.max = 10)
kmeansobj$cluster #for each data, which cluster is in.
> [1] 3 1 1 3 2 2 2 2 2 2 2 2
plot(x,y, col= kmeansobj$cluster, pch = 19); points(kmeansobj$centers, col=1:3, pch=3)
rnorm()dnorm()pnorm()rnorm()sample()There are different functions for probability distribution in R:
Hay diferentes functiones o letras asociadas: p: distribucion cumulativa, q: funcion cuantil, r: generador de numeros aleatorios, d: densidad. En orden: rnorm, dnorm
> function (n, mean = 0, sd = 1)
> function (x, mean = 0, sd = 1, log = FALSE)
> function (q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
> function (n, lambda)
Para reajustar o reestablecer el generador de numeros se utiliza set.seed(1).
The function sample draws ramdomly from a specified set of sclara objects allowing to sample from arbitry distributions.
str(sample) # con repeat se puede obtener numeros repetidos
> function (x, size, replace = FALSE, prob = NULL)
set.seed(1)
sample(1:10,4)
> [1] 9 4 7 1
telegram <- function(...){paste("START", ..., "STOP")}
lm()melt()lm()lm()In the 1880’s, Francis Galton was developing ways to quantify the heritability of traits. As part of this work, he collected data on the heights of adult children and their parents. We use melt to reshape the table
library(reshape);
data(galton);
str(galton)
> 'data.frame': 928 obs. of 2 variables:
> $ child : num 61.7 61.7 61.7 61.7 61.7 62.2 62.2 62.2 62.2 62.2 ...
> $ parent: num 70.5 68.5 65.5 64.5 64 67.5 67.5 67.5 66.5 66.5 ...
head(galton)
> child parent
> 1 61.7 70.5
> 2 61.7 68.5
> 3 61.7 65.5
> 4 61.7 64.5
> 5 61.7 64.0
> 6 62.2 67.5
long <- melt(galton)
> Using as id variables
head(long)
> variable value
> 1 child 61.7
> 2 child 61.7
> 3 child 61.7
> 4 child 61.7
> 5 child 61.7
> 6 child 62.2
g <- ggplot(long, aes(x = value, fill = variable))
g <- g + geom_histogram(colour = "black", binwidth=1)
g <- g + facet_grid(. ~ variable) + labs("Histogram of childs and parents heigths")
g
The command for simple linear regression is lm(response variable~explanatory variable). lm() is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance. Its the same calculating the betas or using lm().
When fitting a linear model y ~ x - 1 specifies a line through the origin. This is useful in some cases
y <- galton$child
x <- galton$parent
beta1 <- cor(y, x) * sd(y) / sd(x)
beta0 <- mean(y) - beta1 * mean(x)
rbind(c(beta0, beta1), coef(lm(y ~ x))) #Same using both procedures
> (Intercept) x
> [1,] 23.94153 0.6462906
> [2,] 23.94153 0.6462906
Si se normalizan los datos, entonces, la pendiente es la correlacion. La correlacion de x y y es la misma que de los valores normalizados y de la ecuacion de la recta que pasa por los ultimos. Pero el coeficiente no es el mismo que cuando pasa por datos no normalizados.
yn <- (y - mean(y))/sd(y)
xn <- (x - mean(x))/sd(x)
c(cor(y, x), cor(yn, xn), coef(lm(yn ~ xn))[2])
> xn
> 0.4587624 0.4587624 0.4587624
Add regression line to plot
Consider the data set given below. Give the value of mu that minimazes the least squares equation
x <- c(0.18, -1.54, 0.42, 0.95)
w <- c(2, 1, 3, 1)
mean(w*x)/mean(w)
> [1] 0.1471429
Consider the data set. Fit the regression through the origin and get the slope treating y as the outcome and x as the regressor. (Hint, do not center the data since we want regression through the origin, not through the means of the data.)
x <- c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42)
y <- c(1.39, 0.72, 1.55, 0.48, 1.19, -1.59, 1.23, -0.65, 1.49, 0.05)
lm(formula = y ~ x - 1)
>
> Call:
> lm(formula = y ~ x - 1)
>
> Coefficients:
> x
> 0.8263
summary(lm(y~x-1))
>
> Call:
> lm(formula = y ~ x - 1)
>
> Residuals:
> Min 1Q Median 3Q Max
> -2.0692 -0.2536 0.5303 0.8592 1.1286
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> x 0.8263 0.5817 1.421 0.189
>
> Residual standard error: 1.094 on 9 degrees of freedom
> Multiple R-squared: 0.1831, Adjusted R-squared: 0.09238
> F-statistic: 2.018 on 1 and 9 DF, p-value: 0.1892
Do data(mtcars) from the datasets package and fit the regression model with mpg as the outcome and weight as the predictor. Give the slope coefficient.
data(mtcars)
head(mtcars)
> mpg cyl disp hp drat wt qsec vs am gear carb
> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
x <- mtcars$wt
y <- mtcars$mpg
lm(y ~ x)
>
> Call:
> lm(formula = y ~ x)
>
> Coefficients:
> (Intercept) x
> 37.285 -5.344
summary(lm(mpg ~ wt, mtcars)) #anothes form to write this
>
> Call:
> lm(formula = mpg ~ wt, data = mtcars)
>
> Residuals:
> Min 1Q Median 3Q Max
> -4.5432 -2.3647 -0.1252 1.4096 6.8727
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
> wt -5.3445 0.5591 -9.559 1.29e-10 ***
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> Residual standard error: 3.046 on 30 degrees of freedom
> Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
> F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
What is the value of the first measurement if x were normalized (to have mean 0 and variance 1)?
x <- c(8.58, 10.46, 9.01, 9.64, 8.86)
xn <- (x-mean(x))/sd(x)
xn
> [1] -0.9718658 1.5310215 -0.3993969 0.4393366 -0.5990954
What value minimizes the sum of the squared distances between these points and itself? For least squares the mean minimizes.
x <- c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42)
mean(x)
> [1] 0.573
x <- c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42)
y <- c(1.39, 0.72, 1.55, 0.48, 1.19, -1.59, 1.23, -0.65, 1.49, 0.05)
lm(y~x)
>
> Call:
> lm(formula = y ~ x)
>
> Coefficients:
> (Intercept) x
> 1.567 -1.713