library(ISLR)
Inicialmente veremos el contenido del data set weekly Problema A)
summary(Weekly)
Year Lag1 Lag2 Lag3
Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
Lag4 Lag5 Volume Today
Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
Direction
Down:484
Up :605
Como podemos observar es muy similar al escenario del data set de SMARKET Al observar el plot graph vemos una pobre relación entre las variable libres lags sin emargo el año y el volumen tienen muestra una relación entre estas variables
cor(Weekly[,-9])
Year Lag1 Lag2 Lag3 Lag4
Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
Lag5 Volume Today
Year -0.030519101 0.84194162 -0.032459894
Lag1 -0.008183096 -0.06495131 -0.075031842
Lag2 -0.072499482 -0.08551314 0.059166717
Lag3 0.060657175 -0.06928771 -0.071243639
Lag4 -0.075675027 -0.06107462 -0.007825873
Lag5 1.000000000 -0.05851741 0.011012698
Volume -0.058517414 1.00000000 -0.033077783
Today 0.011012698 -0.03307778 1.000000000
pairs(Weekly)
Not all of the characters in ~/fiabilidad islr lab47.rmd could be encoded using ISO8859-1. To save using a different encoding, choose "File | Save with Encoding..." from the main menu.
Problema B)
visualizaremos el volumen, y observaremos el corrimiento de la densidad sobre la curba
plot(Weekly$Volume)
Calcularemos todas las variables para determinar la estadistica de la regrecion logaritmica
En la siguiente tabla vermos que Lag2 es significativo
glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume ,data=Weekly ,family=binomial)
summary (glm.fit)
Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
Volume, family = binomial, data = Weekly)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6949 -1.2565 0.9913 1.0849 1.4579
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.26686 0.08593 3.106 0.0019 **
Lag1 -0.04127 0.02641 -1.563 0.1181
Lag2 0.05844 0.02686 2.175 0.0296 *
Lag3 -0.01606 0.02666 -0.602 0.5469
Lag4 -0.02779 0.02646 -1.050 0.2937
Lag5 -0.01447 0.02638 -0.549 0.5833
Volume -0.02274 0.03690 -0.616 0.5377
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1496.2 on 1088 degrees of freedom
Residual deviance: 1486.4 on 1082 degrees of freedom
AIC: 1500.4
Number of Fisher Scoring iterations: 4
Problema C) En base a esto visualizaremos el contraste de direcion y visualizaremos nuestra matriz de confusion
contrasts(Weekly$Direction )
Up
Down 0
Up 1
glm.probs=predict(glm.fit,type="response")
glm.pred=rep("Down",length(glm.probs))
glm.pred[glm.probs>0.5]="Up"
table(glm.pred,Weekly$Direction)
glm.pred Down Up
Down 54 48
Up 430 557
mean(glm.pred == Weekly$Direction)
[1] 0.5610652
Problema D)
Generaremos el problema como vimos usando lag2 como variable libre y notaremos que tiene una significancia para ser tomada encuenta en el summary
train <- Weekly[,"Year"] <= 2008
glm.fit <- glm(Direction~Lag2,data = Weekly,subset = train, family = "binomial")
summary(glm.fit)
Call:
glm(formula = Direction ~ Lag2, family = "binomial", data = Weekly,
subset = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.536 -1.264 1.021 1.091 1.368
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.20326 0.06428 3.162 0.00157 **
Lag2 0.05810 0.02870 2.024 0.04298 *
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1354.7 on 984 degrees of freedom
Residual deviance: 1350.5 on 983 degrees of freedom
AIC: 1354.5
Number of Fisher Scoring iterations: 4
A este metodo veremos la matriz de confusion y el mean del test
glm.probs <- predict(glm.fit,Weekly[!train,],type = "response")
glm.pred <- rep("Down",nrow(Weekly))
glm.pred[glm.probs>0.5] = "Up"
table(glm.pred,Weekly[,"Direction"])
glm.pred Down Up
Down 63 85
Up 421 520
mean(glm.pred == Weekly[,"Direction"])
[1] 0.5353535
Problema E) Para el caso del problema LDA, vemos que el factor es mucho mejor en su proceso de predicción
library(MASS)
lda.fit <- lda(Direction~Lag2,data=Weekly,subset=train)
lda.pred <- predict(lda.fit,Weekly[!train,])
lda.class <- lda.pred$class
table(lda.class,Weekly[!train,9])
lda.class Down Up
Down 9 5
Up 34 56
mean(lda.class == Weekly[!train,9])
[1] 0.625
Problema F) Para el problema cuadratico tenemos:
qda.fit <- qda(Direction~Lag2,data=Weekly,subset=train)
qda.pred <- predict(qda.fit,Weekly[!train,])
qda.class <- qda.pred$class
table(qda.class,Weekly[!train,9])
qda.class Down Up
Down 0 0
Up 43 61
mean(qda.class == Weekly[!train,9])
[1] 0.5865385
Problema G) Por ultimo el caso KNN, es pesimo el algoritmo en k=1
library(class)
train.X <- cbind(Weekly[train,3])
test.X <- cbind(Weekly[!train,3])
train.Direction <- Weekly[train,c(9)]
test.Direction <- Weekly[!train,c(9)]
knn.pred <- knn(train.X,test.X,train.Direction,k=1)
table(knn.pred,test.Direction)
test.Direction
knn.pred Down Up
Down 21 30
Up 22 31
mean(knn.pred == test.Direction)
[1] 0.5
Problema H)
Por el momento el mejor escenario es LDA, iteraremos K=4 para KNN en el problema Final
knn.pred <- knn(train.X,test.X,train.Direction,k=4)
table(knn.pred,test.Direction)
test.Direction
knn.pred Down Up
Down 20 19
Up 23 42
mean(knn.pred == test.Direction)
[1] 0.5961538
KNN mejora hasta k=4 pero no logra superar a LDA
En este problema estudiaremos el comportamiento del rendimiento de automobiles en el dataset Auto
Problema a) Crear arreglo binario basados en la media si el valor es mayor a la media para mpg
mpg01 <- rep(0,nrow(Auto))
mpg01[Auto[,'mpg']>median(Auto[,'mpg'])] <- 1
mpg01 = as.factor(mpg01)
Data = data.frame(Auto,mpg01)
table(mpg01)
mpg01
0 1
196 196
Problema B)
pairs(Data)
library(ggplot2)
ggplot(Data, aes( Data$mpg01,Data$acceleration)) + geom_violin()
ggplot(Data, aes( Data$mpg01,Data$weight)) + geom_violin()
ggplot(Data, aes( Data$mpg01,Data$horsepower)) + geom_violin()
ggplot(Data, aes( Data$mpg01,Data$displacement)) + geom_violin()
Problema C) crear el train y test basado en porcentajes
train_sample <- sample(1:nrow(Data),size = nrow(Data)*0.7 )
train_data <- Data[train_sample,]
test_data <- setdiff(1:nrow(Data), train_data)
Data.train = train_data
Data.test = test_data
lda.fit = lda(mpg01~weight+displacement,Data)
lda.fit
Call:
lda(mpg01 ~ weight + displacement, data = Data)
Prior probabilities of groups:
0 1
0.5 0.5
Group means:
weight displacement
0 3620.403 273.1582
1 2334.765 115.6658
Coefficients of linear discriminants:
LD1
weight -0.001011194
displacement -0.006968032
Problema D) Prediciendo basado en las variables libres
Data.test = Data[196:392,]
lda.pred = predict(lda.fit, Data.test[,c('horsepower','weight','displacement')] )
table(lda.pred$class,Data.test[,'mpg01'])
0 1
0 56 5
1 11 125
mean(lda.pred$class != Data.test[,'mpg01'])
[1] 0.08121827
Problema E) generando el modelo cuadratico es un modelo pesimo pues solo determina el 10%
qda.fit = qda(mpg01~horsepower+displacement,Data)
qda.pred = predict(qda.fit, Data.test[,c('horsepower','weight','displacement')] )
table(qda.pred$class,Data.test[,'mpg01'])
0 1
0 57 11
1 10 119
mean(qda.pred$class != Data.test[,'mpg01'])
[1] 0.106599
Problema F) Para el caso de del metodo de R logistica
logit.fit = glm(mpg01~horsepower+displacement,family = binomial,Data.train)
logit.fit
Call: glm(formula = mpg01 ~ horsepower + displacement, family = binomial,
data = Data.train)
Coefficients:
(Intercept) horsepower displacement
9.19005 -0.04605 -0.03501
Degrees of Freedom: 195 Total (i.e. Null); 193 Residual
Null Deviance: 251.8
Residual Deviance: 84.57 AIC: 90.57
summary(logit.fit)
Call:
glm(formula = mpg01 ~ horsepower + displacement, family = binomial,
data = Data.train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.30936 -0.17525 -0.00666 0.34107 2.11244
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 9.19005 2.03526 4.515 6.32e-06 ***
horsepower -0.04605 0.02242 -2.054 0.04 *
displacement -0.03501 0.00745 -4.700 2.60e-06 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 251.761 on 195 degrees of freedom
Residual deviance: 84.575 on 193 degrees of freedom
AIC: 90.575
Number of Fisher Scoring iterations: 8
Determinamos la matriz de confución con un thresshold del 50%
mean(logit.class != Data.test[,'mpg01'])
[1] 0.1878173
Para el caso del 25% en el proceso de clasificación
mean(logit.class != Data.test[,'mpg01'])
[1] 0.1675127
Probelma A,B,C,D) Construir una funcion de potencias el resultado es 8=2 elevado a la 3 Todos los probelmas conceptualmente son lo mismo
potencia(10,3)
[1] 1000
Probelma E)
Potencias por arreglo
potencia_array = function(x,a){
y = sapply(x,function(x){x^a})
return(y)
}
Warning messages:
1: In knitr::knit(knit_input, knit_output, envir = envir, quiet = quiet, :
The encoding ("ISO8859-1") is not UTF-8. We will only support UTF-8 in the future. Please re-save your file "fiabilidad_islr_lab47.rmd" with the UTF-8 encoding. See https://yihui.name/en/2018/11/biggest-regret-knitr/ for more info.
2: In knitr::knit(knit_input, knit_output, envir = envir, quiet = quiet, :
The encoding ("ISO8859-1") is not UTF-8. We will only support UTF-8 in the future. Please re-save your file "fiabilidad_islr_lab47.rmd" with the UTF-8 encoding. See https://yihui.name/en/2018/11/biggest-regret-knitr/ for more info.
potencia_array(1:10,3)
[1] 1 8 27 64 125 216 343 512 729 1000
plot(potencia_array(1:10,3))