library(ISLR)
str(Weekly)
'data.frame': 1089 obs. of 9 variables:
$ Year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
$ Lag1 : num 0.816 -0.27 -2.576 3.514 0.712 ...
$ Lag2 : num 1.572 0.816 -0.27 -2.576 3.514 ...
$ Lag3 : num -3.936 1.572 0.816 -0.27 -2.576 ...
$ Lag4 : num -0.229 -3.936 1.572 0.816 -0.27 ...
$ Lag5 : num -3.484 -0.229 -3.936 1.572 0.816 ...
$ Volume : num 0.155 0.149 0.16 0.162 0.154 ...
$ Today : num -0.27 -2.576 3.514 0.712 1.178 ...
$ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
summary(Weekly)
Year Lag1 Lag2 Lag3 Lag4
Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580 1st Qu.: -1.1580
Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410 Median : 0.2380
Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472 Mean : 0.1458
3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090 3rd Qu.: 1.4090
Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
Lag5 Volume Today Direction
Min. :-18.1950 Min. :0.08747 Min. :-18.1950 Down:484
1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540 Up :605
Median : 0.2340 Median :1.00268 Median : 0.2410
Mean : 0.1399 Mean :1.57462 Mean : 0.1499
3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
Max. : 12.0260 Max. :9.32821 Max. : 12.0260
plot(Weekly)
The Volume increased with the Year. No other patterns are clear enough to be observed.
lrf <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = 'binomial')
summary(lrf)
Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
Volume, family = "binomial", data = Weekly)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6949 -1.2565 0.9913 1.0849 1.4579
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.26686 0.08593 3.106 0.0019 **
Lag1 -0.04127 0.02641 -1.563 0.1181
Lag2 0.05844 0.02686 2.175 0.0296 *
Lag3 -0.01606 0.02666 -0.602 0.5469
Lag4 -0.02779 0.02646 -1.050 0.2937
Lag5 -0.01447 0.02638 -0.549 0.5833
Volume -0.02274 0.03690 -0.616 0.5377
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1496.2 on 1088 degrees of freedom
Residual deviance: 1486.4 on 1082 degrees of freedom
AIC: 1500.4
Number of Fisher Scoring iterations: 4
Lag2 is statistically significant, with p-value 3%.
lrp <- predict(lrf, type = 'response')
lrpred <- rep('Down', length(lrp))
lrpred[lrp > 0.5] <- 'Up'
table(lrpred, Weekly$Direction)
lrpred Down Up
Down 54 48
Up 430 557
The overall error rate: \((48 + 430) \div (54 + 48 + 430 + 557) = 0.439\)
Logistic regression predict correctly most of time when direction is “Up”: \(557 \div (557 + 48)\), wrongly most of time when it’s “Down”: \(54 \div (54 + 430)\).
isTest <- Weekly$Year > 2008
training <- Weekly[!isTest, ]
testing <- Weekly[isTest, ]
f10d <- glm(Direction ~ Lag2, data = training, family = 'binomial')
pred10d <- predict(f10d, testing, type = 'response')
res10d <- rep('Down', length(pred10d))
res10d[pred10d > 0.5] = 'Up'
table(res10d, testing$Direction)
res10d Down Up
Down 9 5
Up 34 56
mean(res10d == testing$Direction)
[1] 0.625
library(MASS)
f10e <- lda(Direction ~ Lag2, data = training)
pred10e <- predict(f10e, testing)
table(pred10e$class, testing$Direction)
Down Up
Down 9 5
Up 34 56
mean(pred10e$class == testing$Direction)
[1] 0.625
f10f <- qda(Direction ~ Lag2, data = training)
pred10f <- predict(f10f, testing)
table(pred10f$class, testing$Direction)
Down Up
Down 0 0
Up 43 61
mean(pred10f$class == testing$Direction)
[1] 0.5865385
library(class)
set.seed(1)
f10g <- knn(as.matrix(training$Lag2), as.matrix(testing$Lag2), training$Direction, k = 1)
table(f10g, testing$Direction)
f10g Down Up
Down 21 30
Up 22 31
mean(f10g == testing$Direction)
[1] 0.5
Notice the first 2 parameters of knn()
function must be data frame or matrix.
对本章第8题的验证: KNN 算法在 \(K=1\) 时的训练正确率是否为 100%?
library(tidyverse)
set.seed(1)
training <- Weekly[!isTest, ]
mk1 <- knn(as.matrix(training$Lag2), as.matrix(training$Lag2), training$Direction, k = 1)
table(mk1, training$Direction)
mk1 Down Up
Down 422 22
Up 19 522
mean(mk1 == training$Direction)
[1] 0.9583756
正确率居然不是 100%?
为什么会有41个点的 Direction 值与自己不同? 让我们先把它们找出来:
outliers <- which(mk1 != training$Direction)
length(outliers)
[1] 41
outliers[1:2]
[1] 12 108
异常点共有41个,与上面 table()
的输出 22 + 19 一致。 下面分析一下这些异常点和正常点有什么不一样。
首先取前两个异常点(下标分别为12和108)作为分析对象, 取它们后面的值作为对比,观察它们各自在半径0.01范围内的邻居:
t1 <- 12
t2 <- 108
eps <- 0.01
filter(training, Lag2 > training$Lag2[t1] - eps & Lag2 < training$Lag2[t1] + eps)
filter(training, Lag2 > training$Lag2[t1 + 1] - eps & Lag2 < training$Lag2[t1 + 1] + eps)
filter(training, Lag2 > training$Lag2[t2] - eps & Lag2 < training$Lag2[t2] + eps)
filter(training, Lag2 > training$Lag2[t2 + 1] - eps & Lag2 < training$Lag2[t2 + 1] + eps)
不难看出,这些点之所以与“自己”的 Direction 取值不同,是因为有一个或多个相同值的点存在导致的。
最后回答第8题引出的问题:
当数据集里没有重叠点时,\(K=1\)的 knn
在自身训练集上的正确率是 100%,
如果数据集里有重叠点,是否出出现异常点取决于重叠点的标记(这里是 Direction)取值是否一致, 异常点越多,正确率越低。
Logistic Regression in (d) and LDA method in (e) provide the best result (correct rate: 62.5%) on this data.
auto11 <- Auto
auto11$mpg01 <- 0
auto11$mpg01[Auto$mpg > median(Auto$mpg)] <- 1
plot(auto11)
par(mfrow=c(3,3))
boxplot(cylinders ~ mpg01, data = auto11)
boxplot(displacement ~ mpg01, data = auto11)
boxplot(horsepower ~ mpg01, data = auto11)
boxplot(weight ~ mpg01, data = auto11)
boxplot(acceleration ~ mpg01, data = auto11)
boxplot(year ~ mpg01, data = auto11)
boxplot(origin ~ mpg01, data = auto11)
par(mfrow = c(1,1))
cor(auto11[, -9]) # calculate cor matrix except *name* column
cylinder, displacement, horsepower and weight have relative high negative corelations with mpg01 (absolute value are all above 0.65).
isTraining <- sample(nrow(auto11), nrow(auto11) * 0.8)
training <- auto11[isTraining, ]
testing <- auto11[-isTraining, ]
f11d <- lda(mpg01 ~ cylinders + displacement + horsepower + weight, data = auto11)
pred11d <- predict(f11d, testing)
mean(pred11d$class == testing$mpg01)
The test error is \(1 - 0.873 = 0.127%\).
f11e <- qda(mpg01 ~ cylinders + displacement + horsepower + weight, data = auto11)
pred11e <- predict(f11e, testing)
1 - mean(pred11e$class == testing$mpg01)
f11f <- glm(mpg01 ~ cylinders + displacement + horsepower + weight, data = auto11)
pred11f <- predict(f11f, testing)
res11f <- rep(0, length(pred11f))
res11f[pred11f > 0.5] = 1
1 - mean(res11f == testing$mpg01)
set.seed(1)
train11g <- auto11[isTraining, c('cylinders', 'displacement', 'horsepower', 'weight')]
test11g <- auto11[-isTraining, c('cylinders', 'displacement', 'horsepower', 'weight')]
trainRes <- auto11$mpg01[isTraining]
f11g1 <- knn(train11g, test11g, trainRes, k = 1)
mean(f11g1 == auto11$mpg01[-isTraining])
f11g3 <- knn(train11g, test11g, trainRes, k = 3)
mean(f11g3 == auto11$mpg01[-isTraining])
f11g5 <- knn(train11g, test11g, trainRes, k = 5)
mean(f11g5 == auto11$mpg01[-isTraining])
f11g100 <- knn(train11g, test11g, trainRes, k = 100)
mean(f11g100 == auto11$mpg01[-isTraining])
The minimum error rate is \(1 - 0.937 = 0.063\). \(k = 5\) performs best on this data set.
Power <- function() { 2 ^ 3 }
Power()
Power2 <- function(x, a) { x ^ a }
Power2(3, 8)
Power2(10, 3)
Power2(8, 17)
Power2(131, 3)
As the current version of R, the last expression is the return value of the function. So it’s unnecessary to use return()
function explicitly:
version
par(mfrow=c(1,2))
x <- 1:10
plot(x, Power2(x, 2), main = 'f(x) = x^2', xlab = 'x = 1:10', ylab = 'x^2')
plot(log(x), log(Power2(x, 2)), main = 'log(f(x))', xlab = 'x = 1:10', ylab = '2 * log(x)')
PlotPower <- function(x, p) {
plot(x, x ^ p, main = 'power p of x')
}
PlotPower(1:10, 3)
Extract training and testing data, and predict based on the first group of predictors with logistic regression method:
boston13 <- Boston
boston13$crime.rate <- 0
boston13$crime.rate[Boston$crim > median(Boston$crim)] <- 1
isTraining <- sample(nrow(boston13), 0.8 * nrow(boston13))
training <- boston13[isTraining, ]
testing <- boston13[-isTraining, ]
lr.fit1 <- glm(crime.rate ~ zn + indus + chas + nox, data = boston13, family = 'binomial')
lr.pred1 <- predict(lr.fit1, testing)
lr.res1 <- rep(0, length(lr.pred1))
lr.res1[lr.pred1 > median(Boston$crim)] <- 1
mean(lr.res1 == testing$crime.rate)
Using another group of predictors:
lr.fit2 <- glm(crime.rate ~ rm + age + dis + rad, data = boston13, family = 'binomial')
lr.pred2 <- predict(lr.fit2, testing)
lr.res2 <- rep(0, length(lr.pred2))
lr.res2[lr.pred2 > median(Boston$crim)] <- 1
mean(lr.res2 == testing$crime.rate)
Predict on the 1st group predictors with LDA:
lda.fit1 <- lda(crime.rate ~ zn + indus + chas + nox, data = boston13)
lda.pred1 <- predict(lda.fit1, testing)
mean(lda.pred1$class == testing$crime.rate)
Predict on the 2nd group predictors with LDA:
lda.fit2 <- lda(crime.rate ~ rm + age + dis + rad, data = boston13)
lda.pred2 <- predict(lda.fit2, testing)
mean(lda.pred2$class == testing$crime.rate)
Predict on the 1st group predictors with KNN (k = 5):
ktrain1 <- boston13[isTraining, c('zn', 'indus', 'chas', 'nox')]
ktest1 <- boston13[-isTraining, c('zn', 'indus', 'chas', 'nox')]
ktrain.result <- boston13$crime.rate[isTraining]
kmod1 <- knn(ktrain1, ktest1, ktrain.result, k = 5)
mean(kmod1 == boston13$crime.rate[-isTraining])
Predict on the 2nd group predictors with KNN (k = 5):
ktrain2 <- boston13[isTraining, c('rm', 'age', 'dis', 'rad')]
ktest2 <- boston13[-isTraining, c('rm', 'age', 'dis', 'rad')]
kmod2 <- knn(ktrain2, ktest2, ktrain.result, k = 5)
mean(kmod2 == boston13$crime.rate[-isTraining])
KNN (k=5) based on the first group of predictors has the highest accuracy.