Acontinuacion se desarrollan los ejercicios del libro “Introduction to Statistical Learning” de Gareth James, Daniel Witten, Trevor Hastie y Robert Tibshirani.
This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
Attaching package: ‘dplyr’
The following object is masked from ‘package:MASS’:
select
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
Loading required package: lattice
Loading required package: ggplot2
RStudio Community is a great place to get help: https://community.rstudio.com/c/tidyverse.
Weekly
summary(Weekly)
Year Lag1 Lag2 Lag3 Lag4
Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580 1st Qu.: -1.1580
Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410 Median : 0.2380
Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472 Mean : 0.1458
3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090 3rd Qu.: 1.4090
Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
Lag5 Volume Today Direction
Min. :-18.1950 Min. :0.08747 Min. :-18.1950 Down:484
1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540 Up :605
Median : 0.2340 Median :1.00268 Median : 0.2410
Mean : 0.1399 Mean :1.57462 Mean : 0.1499
3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
Max. : 12.0260 Max. :9.32821 Max. : 12.0260
pairs(Weekly)
cor(Weekly[,-9])
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today
Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923 -0.030519101 0.84194162 -0.032459894
Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876 -0.008183096 -0.06495131 -0.075031842
Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535 -0.072499482 -0.08551314 0.059166717
Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865 0.060657175 -0.06928771 -0.071243639
Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000 -0.075675027 -0.06107462 -0.007825873
Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027 1.000000000 -0.05851741 0.011012698
Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617 -0.058517414 1.00000000 -0.033077783
Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873 0.011012698 -0.03307778 1.000000000
R/ Year y Volume son las unicas variables que parecen tener alguna correlacion, dado que son las unicas que presentan una relacion positiva.
attach(Weekly)
glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly,
family = binomial)
summary(glm.fit)
Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
Volume, family = binomial, data = Weekly)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6949 -1.2565 0.9913 1.0849 1.4579
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.26686 0.08593 3.106 0.0019 **
Lag1 -0.04127 0.02641 -1.563 0.1181
Lag2 0.05844 0.02686 2.175 0.0296 *
Lag3 -0.01606 0.02666 -0.602 0.5469
Lag4 -0.02779 0.02646 -1.050 0.2937
Lag5 -0.01447 0.02638 -0.549 0.5833
Volume -0.02274 0.03690 -0.616 0.5377
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1496.2 on 1088 degrees of freedom
Residual deviance: 1486.4 on 1082 degrees of freedom
AIC: 1500.4
Number of Fisher Scoring iterations: 4
R/ Lag 2 parece tener significancia al presentar un \(P_r(>|z|)=0.0296\)
glm.probs = predict(glm.fit, type = "response")
glm.pred = rep("Down", length(glm.probs))
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred, Direction)
Direction
glm.pred Down Up
Down 54 48
Up 430 557
R/ La prediccion actual es de 56%. Segun la regresion logistica la prediccion de las semanas cuenta con 92% con un error del 11%.
train = (Year < 2009)
Weekly.0910 = Weekly[!train, ]
glm.fit = glm(Direction ~ Lag2, data = Weekly, family = binomial, subset = train)
glm.probs = predict(glm.fit, Weekly.0910, type = "response")
glm.pred = rep("Down", length(glm.probs))
glm.pred[glm.probs > 0.5] = "Up"
Direction.0910 = Direction[!train]
table(glm.pred, Direction.0910)
Direction.0910
glm.pred Down Up
Down 9 5
Up 34 56
mean(glm.pred == Direction.0910)
[1] 0.625
lda.fit = lda(Direction ~ Lag2, data = Weekly, subset = train)
lda.pred = predict(lda.fit, Weekly.0910)
table(lda.pred$class, Direction.0910)
Direction.0910
Down Up
Down 9 5
Up 34 56
mean(lda.pred$class == Direction.0910)
[1] 0.625
qda.fit = qda(Direction ~ Lag2, data = Weekly, subset = train)
qda.class = predict(qda.fit, Weekly.0910)$class
table(qda.class, Direction.0910)
Direction.0910
qda.class Down Up
Down 0 0
Up 43 61
mean(qda.class == Direction.0910)
[1] 0.5865385
train.X = as.matrix(Lag2[train])
test.X = as.matrix(Lag2[!train])
train.Direction = Direction[train]
set.seed(1)
knn.pred = knn(train.X, test.X, train.Direction, k = 1)
table(knn.pred, Direction.0910)
Direction.0910
knn.pred Down Up
Down 21 30
Up 22 31
mean(knn.pred == Direction.0910)
[1] 0.5
R/ Los métodos de regresión logística y LDA proporcionan tasas de error de prueba similares.
glm.fit = glm(Direction ~ Lag2:Lag1, data = Weekly, family = binomial, subset = train)
glm.probs = predict(glm.fit, Weekly.0910, type = "response")
glm.pred = rep("Down", length(glm.probs))
glm.pred[glm.probs > 0.5] = "Up"
Direction.0910 = Direction[!train]
table(glm.pred, Direction.0910)
Direction.0910
glm.pred Down Up
Down 1 1
Up 42 60
mean(glm.pred == Direction.0910)
[1] 0.5865385
lda.fit = lda(Direction ~ Lag2:Lag1, data = Weekly, subset = train)
lda.pred = predict(lda.fit, Weekly.0910)
mean(lda.pred$class == Direction.0910)
[1] 0.5769231
qda.fit = qda(Direction ~ Lag2 + sqrt(abs(Lag2)), data = Weekly, subset = train)
qda.class = predict(qda.fit, Weekly.0910)$class
table(qda.class, Direction.0910)
Direction.0910
qda.class Down Up
Down 12 13
Up 31 48
mean(qda.class == Direction.0910)
[1] 0.5769231
knn.pred = knn(train.X, test.X, train.Direction, k = 10)
table(knn.pred, Direction.0910)
Direction.0910
knn.pred Down Up
Down 17 18
Up 26 43
mean(knn.pred == Direction.0910)
[1] 0.5769231
knn.pred = knn(train.X, test.X, train.Direction, k = 100)
table(knn.pred, Direction.0910)
Direction.0910
knn.pred Down Up
Down 9 12
Up 34 49
mean(knn.pred == Direction.0910)
[1] 0.5576923
In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
summary(Auto)
mpg cylinders displacement horsepower weight acceleration
Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613 Min. : 8.00
1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225 1st Qu.:13.78
Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804 Median :15.50
Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978 Mean :15.54
3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615 3rd Qu.:17.02
Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140 Max. :24.80
year origin name
Min. :70.00 Min. :1.000 amc matador : 5
1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
Median :76.00 Median :1.000 toyota corolla : 5
Mean :75.98 Mean :1.577 amc gremlin : 4
3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
Max. :82.00 Max. :3.000 chevrolet chevette: 4
(Other) :365
nauto <- Auto %>% mutate(mpg01 = ifelse(mpg > median(mpg), 1, 0))
nauto
cor(nauto[, -9])
mpg cylinders displacement horsepower weight acceleration year origin
mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442 0.4233285 0.5805410 0.5652088
cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273 -0.5046834 -0.3456474 -0.5689316
displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944 -0.5438005 -0.3698552 -0.6145351
horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377 -0.6891955 -0.4163615 -0.4551715
weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000 -0.4168392 -0.3091199 -0.5850054
acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392 1.0000000 0.2903161 0.2127458
year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199 0.2903161 1.0000000 0.1815277
origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054 0.2127458 0.1815277 1.0000000
mpg01 0.8369392 -0.7591939 -0.7534766 -0.6670526 -0.7577566 0.3468215 0.4299042 0.5136984
mpg01
mpg 0.8369392
cylinders -0.7591939
displacement -0.7534766
horsepower -0.6670526
weight -0.7577566
acceleration 0.3468215
year 0.4299042
origin 0.5136984
mpg01 1.0000000
pairs(nauto)
Como vemos podriamos usar acceleration, year, origin, y por su puesto mpg.
index <- createDataPartition(Auto$year, p=0.7, list=F)
train <- nauto[index,]
test <- nauto[-index,]
length(nauto$year)
[1] 392
length(train$year)
[1] 276
length(test$year)
[1] 116
lda_fit <- lda(mpg01~acceleration+year+origin+mpg, data = nauto, subset = index)
lda_pred <- predict(lda_fit, test)
mean(lda_pred$class != test$mpg01)
[1] 0.01724138
Como vemos logramos una prediccion de 2.5%, que pasa si probamos las otras variables
lda_fit <- lda(mpg01~cylinders + weight + displacement + horsepower, data = nauto, subset = index)
lda_pred <- predict(lda_fit, test)
mean(lda_pred$class != test$mpg01)
[1] 0.1034483
Con este nuevo modelo obetenemos una prediccion de 10%.
qda_fit <- qda(mpg01~acceleration+year+origin+mpg, data = nauto, subset = index)
lda_pred <- predict(qda_fit, test)
mean(lda_pred$class != test$mpg01)
[1] 0.04310345
qda_fit <- qda(mpg01~cylinders + weight + displacement + horsepower, data = nauto, subset = index)
lda_pred <- predict(qda_fit, test)
mean(lda_pred$class != test$mpg01)
[1] 0.09482759
glm_fit <- glm(mpg01~acceleration+year+origin+mpg, data = nauto, family=binomial, subset = index)
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
glm_probs <- predict(glm_fit, test, type="response")
glm_pred <- rep(0, length(glm_probs))
glm_pred[glm_probs>0.5] = 1
mean(glm_pred != test$mpg01)
[1] 0
glm_fit <- glm(mpg01~cylinders + weight + displacement + horsepower, data = nauto, family=binomial, subset = index)
glm_probs <- predict(glm_fit, test, type="response")
glm_pred <- rep(0, length(glm_probs))
glm_pred[glm_probs>0.5] = 1
mean(glm_pred != test$mpg01)
[1] 0.1206897
train_x = train %>% select(cylinders, weight, displacement, horsepower)
test_x = test %>% select(cylinders, weight, displacement, horsepower)
\(k=1\)
knn_pred = knn(train_x, test_x, train$mpg01, k = 1)
mean(knn_pred != test$mpg01)
[1] 0.1206897
\(k=20\)
knn_pred = knn(train_x, test_x, train$mpg01, k = 20)
mean(knn_pred != test$mpg01)
[1] 0.1293103
\(k=500\)
knn_pred = knn(train_x, test_x, train$mpg01, k = 135)
mean(knn_pred != test$mpg01)
[1] 0.1293103
power <- function(x){
return(2^x)
}
power(3)
[1] 8
power2 <- function(x, a){
return(x^a)
}
power2(3, 8)
[1] 6561
power2(10,3)
[1] 1000
power2(8,17)
[1] 2.2518e+15
power2(131,3)
[1] 2248091
power3 <- power2
x = 1:10
plot(x, power3(x, 2), log = "xy")
plot_power <- function(x, a){
plot(x, power3(x,a), log="xy")
}
plot_power(1:10, 3)
Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.
summary(Boston)
crim zn indus chas nox rm
Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000 Min. :0.3850 Min. :3.561
1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000 1st Qu.:0.4490 1st Qu.:5.886
Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000 Median :0.5380 Median :6.208
Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917 Mean :0.5547 Mean :6.285
3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000 3rd Qu.:0.6240 3rd Qu.:6.623
Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000 Max. :0.8710 Max. :8.780
age dis rad tax ptratio black
Min. : 2.90 Min. : 1.130 Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
1st Qu.: 45.02 1st Qu.: 2.100 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
Median : 77.50 Median : 3.207 Median : 5.000 Median :330.0 Median :19.05 Median :391.44
Mean : 68.57 Mean : 3.795 Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
3rd Qu.: 94.08 3rd Qu.: 5.188 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
Max. :100.00 Max. :12.127 Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
lstat medv
Min. : 1.73 Min. : 5.00
1st Qu.: 6.95 1st Qu.:17.02
Median :11.36 Median :21.20
Mean :12.65 Mean :22.53
3rd Qu.:16.95 3rd Qu.:25.00
Max. :37.97 Max. :50.00
nboston <- Boston %>% mutate(crim_avg = ifelse(crim >= median(crim), 1, 0))
nboston
Primero crearemos la particion 70-30
boston_index <- createDataPartition(nboston$crim, p=0.7, list=F)
boston_train <- nboston[boston_index,]
boston_test <- nboston[-boston_index,]
Ahora probemos la regresion geometrica
boston_glm <- glm(crim_avg ~ . - crim_avg - crim,data = nboston, family = binomial, subset=boston_index)
boston_glm_probs <- predict(boston_glm, boston_test, type="response")
boston_glm_pred <- rep(0, length(boston_glm_probs))
boston_glm_pred[boston_glm_probs > 0.5] <- 1
mean(boston_glm_pred != boston_test$crim_avg)
[1] 0.08
Utilizando regresion geometrica son 12.6% de error para un 87.4% de exito.
Ahora usaremos el algoritmo LDA
boston_lda = lda(crim_avg ~ . - crim_avg - crim, data = nboston, subset = boston_index)
boston_lda_pred = predict(boston_lda, boston_test)
mean(boston_lda_pred$class != boston_test$crim_avg)
[1] 0.1666667
LDA nos da un error de 16.67%
Ahora probemos knn con \(k=1\)
boston_knn_pred <- knn(boston_train[,-15], boston_test[,-15], boston_train$crim_avg, k=1)
mean(boston_knn_pred != boston_test$crim_avg)
[1] 0.04666667
Son 8% de error
\(k=10\)
boston_knn_pred <- knn(boston_train[,-15], boston_test[,-15], boston_train$crim_avg, k=10)
mean(boston_knn_pred != boston_test$crim_avg)
[1] 0.1
\(k=50\)
boston_knn_pred <- knn(boston_train[,-15], boston_test[,-15], boston_train$crim_avg, k=50)
mean(boston_knn_pred != boston_test$crim_avg)
[1] 0.14
\(k=100\)
boston_knn_pred <- knn(boston_train[,-15], boston_test[,-15], boston_train$crim_avg, k=100)
mean(boston_knn_pred != boston_test$crim_avg)
[1] 0.2133333