Notebook prepared by Everton Lima
\(p(x) = \frac{e^{\beta_0+\beta_1x}}{1+e^{\beta_0+\beta_1x}}\) (4.2)
\(\frac{p(x)}{1-p(x)} = e^{\beta_0+\beta_1x}\) (4.3)
The equations are equivalent since,
\(1-p(x) = 1-\frac{e^{\beta_0+\beta_1x}}{1+\beta_0+\beta_1x} = \frac{1+e^{\beta_0+\beta_1x}-e^{\beta_0+\beta_1x}}{1+e^{\beta_0+\beta_1x}} = \frac{1}{1+e^{\beta_0+\beta_1x}}\)
\(\frac{1}{1-p(x)} = 1+e^{\beta_0+\beta_1x}\) (1)
and substituting 1 into 4.2 yields 4.3,
\(\frac{p(x)}{1-p(x)} = \frac{e^{\beta_0+\beta_1x}}{1+e^{\beta_0+\beta_1x}}(1+e^{\beta_0+\beta_1x})=e^{\beta_0+\beta_1x}\)
\(p_k(x) = \frac{\pi_k\frac{1}{\sqrt{2\pi\sigma}}exp(-\frac{1}{2\sigma^2_k}(x-\mu_k)^2) }{\sum^K_{l=1} \pi_l\frac{1}{\sqrt{2\pi\sigma}}exp(-\frac{1}{2\sigma^2_k}(x-\mu_l)^2)}\) (4.12)
\(\sigma_k=x\cdot\frac{\mu_k}{\sigma^2} -\frac{\mu_k^2}{2\sigma^2} + log(\pi_k)\) (4.13)
\(log(p_k(x)) = log(\frac{\pi_k\frac{1}{\sqrt{2\pi\sigma}}exp(-\frac{1}{2\sigma^2_k}(x-\mu_k)^2) }{\sum^K_{l=1} \pi_l\frac{1}{\sqrt{2\pi\sigma}}exp(-\frac{1}{2\sigma^2_k}(x-\mu_l)^2)})=log(\pi_k\frac{1}{\sqrt{2\pi\sigma}}exp(-\frac{1}{2\sigma^2_k}(x-\mu_k)^2) ) - log(\sum^K_{l=1} \pi_l\frac{1}{\sqrt{2\pi\sigma}}exp(-\frac{1}{2\sigma^2_k}(x-\mu_l)^2))\) (Eq. 1)
Let \(K = log(\sum^K_{l=1} \pi_l\frac{1}{\sqrt{2\pi\sigma}}exp(-\frac{1}{2\sigma^2_k}(x-\mu_l)^2))\) then Eq. 1 becomes \(log(p_k(x)) = log(\pi_k\frac{1}{\sqrt{2\pi\sigma}}exp(-\frac{1}{2\sigma^2_k}(x-\mu_k)^2) ) - K\)
Further simplifications can be made,
\(log(p_k(x)) = log(\pi_k)+log(\frac{1}{\sqrt{2\pi\sigma}})+log(exp(-\frac{1}{2\sigma^2_k}(x-\mu_k)^2)) - K = -\frac{1}{2\sigma^2}(x-\mu_k)^2 + log(\pi_k)+log(\frac{1}{\sqrt{2\pi\sigma}})-K\)
\(= -\frac{1}{2\sigma^2}(x^2-2x\mu_k+\mu_k^2) + log(\pi_k)+log(\frac{1}{\sqrt{2\pi\sigma}})-K\)
\(= x\cdot\frac{\mu_k}{\sigma^2}+\frac{\mu_k^2}{2\sigma^2} + log(\pi_k)-\frac{x^2}{2\sigma^2}+log(\frac{1}{\sqrt{2\pi\sigma}})-K\)
Where since the Bayes classifier assigns an observation to the class which \(p_k(x)\) is maximized, it assigns to class j if \(p_j(x) > p_i(x)\) for all \(i\neq j\).
\(p_j(x) > p_i(x) = log(p_j(x)) > log(p_i(x))\) for all \(i\neq j\)
\(x\cdot\frac{\mu_j}{\sigma^2}+\frac{\mu_j^2}{2\sigma^2} + log(\pi_j)-\frac{x^2}{2\sigma^2}+log(\frac{1}{\sqrt{2\pi\sigma}})-K > x\cdot\frac{\mu_i}{\sigma^2}+\frac{\mu_i^2}{2\sigma^2} + log(\pi_i)-\frac{x^2}{2\sigma^2}+log(\frac{1}{\sqrt{2\pi\sigma}})-K\)
\(x\cdot\frac{\mu_j}{\sigma^2}+\frac{\mu_j^2}{2\sigma^2} + log(\pi_j)> x\cdot\frac{\mu_i}{\sigma^2}+\frac{\mu_i^2}{2\sigma^2} + log(\pi_i)\) for all \(i\neq j\) since \(\sigma_1 = \sigma_2 = ... = \sigma_n\)
Thus the Bayes classifier assigns an observation to the discriminant function which is maximized.
\(\sigma_k = x\cdot\frac{\mu_k}{\sigma^2} -\frac{\mu_k^2}{2\sigma^2} + log(\pi_k)\)
In the previous problem the quadratic term \(\frac{x^2}{2\sigma^2}\) can be eliminated since LDA assumes \(\sigma = \sigma_1 = \sigma_2 = ... = \sigma_n\). On the other hand, QDA makes no such assumption.
Following the steps in the previous question yields
\(x\cdot\frac{\mu_j}{\sigma^2}+\frac{\mu_j^2}{2\sigma^2} + log(\pi_j)-\frac{x^2}{2\sigma^2}+log(\frac{1}{\sqrt{2\pi\sigma}})-K > x\cdot\frac{\mu_i}{\sigma^2}+\frac{\mu_i^2}{2\sigma^2} + log(\pi_i)-\frac{x^2}{2\sigma^2}+log(\frac{1}{\sqrt{2\pi\sigma}})-K\)
where if no assumption regarding \(\sigma\) is made then an observation is assigned to the class for which the following is true,
\(x\cdot\frac{\mu_j}{\sigma^2}+\frac{\mu_j^2}{2\sigma_j^2} + log(\pi_j)-\frac{x^2}{2\sigma_j^2}+log(\frac{1}{\sqrt{2\pi\sigma_j}})> x\cdot\frac{\mu_i}{\sigma_i^2}+\frac{\mu_i^2}{2\sigma_i^2} + log(\pi_i)-\frac{x^2}{2\sigma_i^2}+log(\frac{1}{\sqrt{2\pi\sigma_i}})\) for all \(j \neq i\)
If X is an uniformly distributed random variable on [0,1], then \(P( X \leq x) = F(x) = x\) where \(0 \leq x \leq 1\). Thus, for any observation \(x_i\) on [0,1] the probability that a value is on the interval [x_1-.05,x_1+.05] is 10% since \(P( X \leq x) = \int_{x_i-.05}^{x_i+.05}f(x)dx =F(x_1-.05)-F(x_1+.05) = (x_i+.05)-(x_i-.05) = .1\) or 10%
Assuming that \(x \in [0.05,0.95]\).
Similarly to the previous problem, If \(X_1\), and \(X_2\) are uniformly distributed random variables then there is a 10% chance of any observation being in the 10% neighborhood of either \(X_1\), or \(X_2\).
Assuming that \(X_1\) and \(X_2\) are independent and let \(x \in X_1\) and \(y \in X_2\), and \(A := [x-.05,x+0.05]\), \(B := [y-0.05,y+0.05]\) so that A and B are neighborhood of \(X_1\), and \(X_2\) respectively. Then \(p(x \in A, y \in B) = P(x \in A)P(y \in B) = 0.01\) or 1%
Thus, on average only 1% of observations are within the 10% neighborhood of both cases.
\(0.1^{100}\)
From the previous question one can note that if all features are independent, and uniformly distributed on [0,1] then with each new feature the amount of neighbors close to an observation decreases. In practice this means that KNN may use neighbors that are very far from a particular observation, yielding very poor results.
Hyper cube
On the training set QDA will perform better than LDA due to a decrease in bias. On the other hand, if Bayes decision boundary is linear then LDA will perform better on the test data set.
Expect QDA to perform better in both cases.
As the sample size increases it is expected that the prediction accuracy of QDA relative to LDA will improve. In general, this will be the case with non-linear methods relative to linear methods since with a increase in sample size there will be an decrease in bias that comes from a non-linear method, but variance will also tend to decrease as n increases.
False.
For small values of n, QDA will tend to model noise and thus the test error will be higher in this case.
\(p(x) = \frac{exp(\hat{\beta_0}+\hat{\beta_1}x_1+\hat{\beta_2}x_2) }{1+exp(\hat{\beta_0}+\hat{\beta_1}x_1+\hat{\beta_2}x_2)}=\frac{exp(-6+0.05\cdot x_1+x_2)}{1+exp(-6+0.05\cdot x_1+x_2)}\approx .3775\) or about 37.75%
Let \(\hat{Y} = \hat{\beta_0}+\hat{\beta_1} x_1+\hat{\beta_2} x_2 = -6+0.05x_1+3.5\), then the logistic regression function can be expressed as \(p1(x) = \frac{ e^{ \hat{Y}} }{ 1 + e^{ \hat{Y}} }\) where if the probability the students gets an A in the class is .5 then, \(p2(x)=\frac{ e^{ \hat{Y}} }{ 1+e^{ \hat{Y}} } = .5\)
so,
\(e^{\hat{Y}} = .5(1+e^{\hat{Y}}) = .5+.5e^{\hat{Y}}\)
\(e^{\hat{Y}} - .5e^{\hat{Y}}= .5\)
\(e^{\hat{Y}}(1 - .5)= .5\)
\(.5e^{\hat{Y}}= .5\)
\(log(.5)+\hat{Y} = log(.5)\)
\(\hat{Y} = -6+0.05x_1+3.5 = 0\)
\(x_1 = 50\)
The normal density function is \(f(x) = \frac{1}{\sqrt{2\pi\sigma}}exp(-\frac{1}{\sqrt{2\pi\sigma^2}}(x-\mu_k)^2)\),
where for k=0, \(f(4) = \frac{1}{\sqrt{2\pi6}}exp(-\frac{1}{2(6)^2}(4-0)^2) \approx 0.05568\) and k=1, \(f(4) = \frac{1}{\sqrt{2\pi6}}exp(-\frac{1}{2(6)^2}(4-10)^2) \approx 0.0403\) then plugging into Bayes formula gives the probability 0.751.
For the case that K = 1, KNN will give a classification error of 0% for training data, since the closest point is itself. Since the average error is 18%, then the error on the test set is 36%, which is 6% higher than the test error obtained by logistic regression. This indicates that logistic regression is a better approach.
library("ISLR")
As before, there is one nominal feature, the Direction, with all other features being numerical.
Moreover, the scatter plot of the Weekly data indicates all the Lag variables are weakly correlated to each other. From the scatter plot, it can also be noted that the volumes of shares traded is rapidly increasing, in a yearly basis.
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202
## Median : 0.2380 Median : 0.2340 Median :1.00268
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821
## Today Direction
## Min. :-18.1950 Down:484
## 1st Qu.: -1.1540 Up :605
## Median : 0.2410
## Mean : 0.1499
## 3rd Qu.: 1.4050
## Max. : 12.0260
plot(Weekly)
As previously observed, the Lag variables are weakly correlated and there is a strong correlation between Year and Volume.
cor(Weekly[,-9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
About 44% of the data is classified as Down, and 55% is classified as Up.
table(Weekly$Direction)/sum(table(Weekly$Direction))
##
## Down Up
## 0.4444444 0.5555556
It is important to note that Directional is simply a nominal form of the Today feature. This should be the case since Today gives the percentage increase, or decrease of the current week, and the variable Direction simply maps these values to ‘Up’, and ‘Down’ respectively.
head(Weekly[,c(8,9)])
## Today Direction
## 1 -0.270 Down
## 2 -2.576 Down
## 3 3.514 Up
## 4 0.712 Up
## 5 1.178 Up
## 6 -1.372 Down
As can be seen below, the only statistically significant variables (at a 5% confidence value) are Lag1, and Lag2.
glm.fit <- glm(Direction~.-Year-Today,data=Weekly,family="binomial")
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ . - Year - Today, family = "binomial",
## data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
From the table below it can be observed that the algorithm classifies about 56% of the observations correctly. It also performs well on the days that the stock increases, since about 92% of ‘Up’ entries are correctly classified. However, it performs poorly on days the market is down, with about 11% of these cases classified correctly.
glm.probs <- predict(glm.fit,type = "response")
glm.pred <- rep("Down",nrow(Weekly))
glm.pred[glm.probs>0.5] = "Up"
table(glm.pred,Weekly$Direction)
##
## glm.pred Down Up
## Down 54 48
## Up 430 557
mean(glm.pred == Weekly$Direction)
## [1] 0.5610652
558/(558+47)
## [1] 0.922314
56/(428+56)
## [1] 0.1157025
train <- Weekly[,"Year"] <= 2008
glm.fit <- glm(Direction~Lag2,data = Weekly,subset = train, family = "binomial")
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ Lag2, family = "binomial", data = Weekly,
## subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.536 -1.264 1.021 1.091 1.368
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
glm.probs <- predict(glm.fit,Weekly[!train,],type = "response")
glm.pred <- rep("Down",nrow(Weekly))
glm.pred[glm.probs>0.5] = "Up"
table(glm.pred,Weekly[,"Direction"]) # Confusion Matrix.
##
## glm.pred Down Up
## Down 63 85
## Up 421 520
mean(glm.pred == Weekly[,"Direction"]) # Fraction of correct predictions.
## [1] 0.5353535
library(MASS)
lda.fit <- lda(Direction~Lag2,data=Weekly,subset=train)
lda.fit
## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
lda.pred <- predict(lda.fit,Weekly[!train,])
lda.class <- lda.pred$class
table(lda.class,Weekly[!train,9])
##
## lda.class Down Up
## Down 9 5
## Up 34 56
mean(lda.class == Weekly[!train,9])
## [1] 0.625
qda.fit <- qda(Direction~Lag2,data=Weekly,subset=train)
qda.fit
## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
qda.pred <- predict(qda.fit,Weekly[!train,])
qda.class <- qda.pred$class
table(qda.class,Weekly[!train,9])
##
## qda.class Down Up
## Down 0 0
## Up 43 61
mean(qda.class == Weekly[!train,9])
## [1] 0.5865385
library(class)
train.X <- cbind(Weekly[train,3])
test.X <- cbind(Weekly[!train,3])
train.Direction <- Weekly[train,c(9)]
test.Direction <- Weekly[!train,c(9)]
knn.pred <- knn(train.X,test.X,train.Direction,k=1)
table(knn.pred,test.Direction)
## test.Direction
## knn.pred Down Up
## Down 21 30
## Up 22 31
mean(knn.pred == test.Direction)
## [1] 0.5
The best performing model is LDA, since it has the highest prediction accuracy (about 68%).
# Logistic Regression
# Increase in assignment probability
glm.fit <- glm(Direction~Lag2,data = Weekly,subset = train, family = "binomial")
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ Lag2, family = "binomial", data = Weekly,
## subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.536 -1.264 1.021 1.091 1.368
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
glm.probs <- predict(glm.fit,Weekly[!train,],type = "response")
glm.pred <- rep("Down",nrow(Weekly))
glm.pred[glm.probs>0.55] = "Up"
table(glm.pred,Weekly[,"Direction"]) # Confusion Matrix.
##
## glm.pred Down Up
## Down 196 266
## Up 288 339
mean(glm.pred == Weekly[,"Direction"]) # Fraction of correct predictions.
## [1] 0.4912764
# Non Linear Transformation
# Up to Cubic power
glm.fit <- glm(Direction~Lag2+I(Lag2^2)+I(Lag2^3),data = Weekly,subset = train, family = "binomial")
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ Lag2 + I(Lag2^2) + I(Lag2^3), family = "binomial",
## data = Weekly, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.194 -1.245 1.008 1.108 1.142
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1608285 0.0714552 2.251 0.0244 *
## Lag2 0.0491970 0.0340597 1.444 0.1486
## I(Lag2^2) 0.0095243 0.0072076 1.321 0.1864
## I(Lag2^3) 0.0005092 0.0005445 0.935 0.3497
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1348.5 on 981 degrees of freedom
## AIC: 1356.5
##
## Number of Fisher Scoring iterations: 4
glm.probs <- predict(glm.fit,Weekly[!train,],type = "response")
glm.pred <- rep("Down",nrow(Weekly))
glm.pred[glm.probs>0.5] = "Up"
table(glm.pred,Weekly[,"Direction"]) # Confusion Matrix.
##
## glm.pred Down Up
## Up 484 605
mean(glm.pred == Weekly[,"Direction"]) # Fraction of correct predictions.
## [1] 0.5555556
# Square Root
glm.fit <- glm(Direction~sqrt(abs(Lag2)),data = Weekly,subset = train, family = "binomial")
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ sqrt(abs(Lag2)), family = "binomial",
## data = Weekly, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.405 -1.263 1.058 1.093 1.136
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.09488 0.15028 0.631 0.528
## sqrt(abs(Lag2)) 0.09961 0.11788 0.845 0.398
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1354.0 on 983 degrees of freedom
## AIC: 1358
##
## Number of Fisher Scoring iterations: 3
glm.probs <- predict(glm.fit,Weekly[!train,],type = "response")
glm.pred <- rep("Down",nrow(Weekly))
glm.pred[glm.probs>0.5] = "Up"
table(glm.pred,Weekly[,"Direction"]) # Confusion Matrix.
##
## glm.pred Down Up
## Up 484 605
mean(glm.pred == Weekly[,"Direction"]) # Fraction of correct predictions.
## [1] 0.5555556
# Log
glm.fit <- glm(Direction~log10(abs(Lag2)),data = Weekly,subset = train, family = "binomial")
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ log10(abs(Lag2)), family = "binomial",
## data = Weekly, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.305 -1.269 1.074 1.088 1.167
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20916 0.06410 3.263 0.0011 **
## log10(abs(Lag2)) 0.06823 0.13149 0.519 0.6038
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1354.4 on 983 degrees of freedom
## AIC: 1358.4
##
## Number of Fisher Scoring iterations: 3
glm.probs <- predict(glm.fit,Weekly[!train,],type = "response")
glm.pred <- rep("Down",nrow(Weekly))
glm.pred[glm.probs>0.5] = "Up"
table(glm.pred,Weekly[,"Direction"]) # Confusion Matrix.
##
## glm.pred Down Up
## Up 484 605
mean(glm.pred == Weekly[,"Direction"]) # Fraction of correct predictions.
## [1] 0.5555556
train.X <- cbind(Weekly[train,3])
test.X <- cbind(Weekly[!train,3])
train.Direction <- Weekly[train,c(9)]
test.Direction <- Weekly[!train,c(9)]
errors <- c()
maxK <- 100
step <- 2
for(j in seq(1,maxK,step)){
knn.pred <- knn(train.X,test.X,train.Direction,k=j)
table(knn.pred,test.Direction)
errors <- c(1-mean(knn.pred == test.Direction),errors)
}
data <- cbind(seq(1,maxK,step),errors)
plot(data,type="l",xlab="k")
knn.pred <- knn(train.X,test.X,train.Direction,k=which.min(errors))
table(knn.pred,test.Direction)
## test.Direction
## knn.pred Down Up
## Down 18 24
## Up 25 37
mean(knn.pred == test.Direction)
## [1] 0.5288462
mpg01 <- rep(0,nrow(Auto))
mpg01[Auto[,'mpg']>median(Auto[,'mpg'])] <- 1
mpg01 = as.factor(mpg01)
Data = data.frame(Auto,mpg01)
table(mpg01)
## mpg01
## 0 1
## 196 196
Horsepower,weight and displacement seem useful in predicting mpg01.
par(mfrow = c(2,2))
boxplot(acceleration~mpg01,Data,main="Accel~mpg01")
boxplot(weight~mpg01,Data,main="Weight~mpg01")
boxplot(horsepower~mpg01,Data,main="Horsep~mpg01")
boxplot(displacement~mpg01,Data,main="Disp~mpg01")
Data.train = Data[1:196,]
Data.test = Data[196:392,]
Using the LDA model with the predictors horsepower, and displacement yields a miss classification error of 8.1%. Adding remaining predictors does not provide a significant improvement.
lda.fit = lda(mpg01~weight+displacement,Data)
lda.fit
## Call:
## lda(mpg01 ~ weight + displacement, data = Data)
##
## Prior probabilities of groups:
## 0 1
## 0.5 0.5
##
## Group means:
## weight displacement
## 0 3620.403 273.1582
## 1 2334.765 115.6658
##
## Coefficients of linear discriminants:
## LD1
## weight -0.001011194
## displacement -0.006968032
lda.pred = predict(lda.fit, Data.test[,c('horsepower','weight','displacement')] )
table(lda.pred$class,Data.test[,'mpg01'])
##
## 0 1
## 0 56 5
## 1 11 125
mean(lda.pred$class != Data.test[,'mpg01'])
## [1] 0.08121827
The test error is slightly higher in this case. 10% of the test data is misclassified.
qda.fit = qda(mpg01~horsepower+displacement,Data)
qda.fit
## Call:
## qda(mpg01 ~ horsepower + displacement, data = Data)
##
## Prior probabilities of groups:
## 0 1
## 0.5 0.5
##
## Group means:
## horsepower displacement
## 0 130.11224 273.1582
## 1 78.82653 115.6658
qda.pred = predict(qda.fit, Data.test[,c('horsepower','weight','displacement')] )
table(qda.pred$class,Data.test[,'mpg01'])
##
## 0 1
## 0 57 11
## 1 10 119
mean(qda.pred$class != Data.test[,'mpg01'])
## [1] 0.106599
Logistic regression is the worst model so far, with 18% of the observations being misclassified. However, lowering the minimum probability threshold improves the model.
logit.fit = glm(mpg01~horsepower+displacement,family = binomial,Data.train)
summary(logit.fit)
##
## Call:
## glm(formula = mpg01 ~ horsepower + displacement, family = binomial,
## data = Data.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.30936 -0.17525 -0.00666 0.34107 2.11244
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.19005 2.03526 4.515 6.32e-06 ***
## horsepower -0.04605 0.02242 -2.054 0.04 *
## displacement -0.03501 0.00745 -4.700 2.60e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 251.761 on 195 degrees of freedom
## Residual deviance: 84.575 on 193 degrees of freedom
## AIC: 90.575
##
## Number of Fisher Scoring iterations: 8
logit.pred = predict(logit.fit,Data.test[,c('horsepower','weight','displacement')])
# min P = 0.5
logit.class = ifelse(logit.pred>0.5,1,0)
table(logit.class,Data.test[,'mpg01'])
##
## logit.class 0 1
## 0 65 35
## 1 2 95
mean(logit.class != Data.test[,'mpg01'])
## [1] 0.1878173
# min P = 0.25
logit.class = ifelse(logit.pred>0.25,1,0)
table(logit.class,Data.test[,'mpg01'])
##
## logit.class 0 1
## 0 65 31
## 1 2 99
mean(logit.class != Data.test[,'mpg01'])
## [1] 0.1675127
The test error lies between 0.13~0.18. The best value of K obtained is 5. However, due to ties occurring when predicting value there is some variation in the model (since in this cases KNN chooses the class randomly).
library('class')
knn.preds = sapply(1:20,function(x){
pred = knn(Data.train[,c('horsepower','weight','displacement')],
Data.test[,c('horsepower','weight','displacement')],
cl = Data.train[,'mpg01'],k=x,l=x/2)
mean(pred != Data.test[,'mpg01'])
})
plot(1/(1:20),knn.preds,type = 'l',xlab='1/K')
summary(knn.preds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1320 0.1510 0.1574 0.1561 0.1624 0.1726
which.min(knn.preds)
## [1] 2
Power = function(){2^3}
Power()
## [1] 8
Power2 = function(x,a){x^a}
Power2(3,8)
## [1] 6561
Power2(10,3)
## [1] 1000
Power2(8,17)
## [1] 2.2518e+15
Power2(131,3)
## [1] 2248091
Power3 = function(x,a){ result=x^a; return(result)}
x = 1:10
y = sapply(x,function(x){Power3(x,3)})
plot(x,y,log='x',main='x ~ x^2',type='l')
PlotPower = function(x,a){
y = sapply(x,function(x){x^a})
plot(x,y,main=paste('x~x^',a,sep = ''),type='l')
}
PlotPower(1:10,3)
library('MASS')
mcrim = median(Boston[,'crim'])
mcrim01 = ifelse(Boston[,'crim']>mcrim,1,0) # create label, 1 is above the meadian and 0 otherwise.
table(mcrim01)
## mcrim01
## 0 1
## 253 253
Data = cbind(Boston,mcrim01)
par(mfrow = c(2,2))
boxplot(medv~mcrim01,Data,main='medv~mcrim01')
boxplot(lstat~mcrim01,Data,main='lstat~mcrim01')
boxplot(dis~mcrim01,Data,main='dis~mcrim01')
boxplot(age~mcrim01,Data,main='age~mcrim01')
From the plots above we can read the following; While there is a great variation, the median value of homes in a town is associated with higher crime. The a higher percentage of a lower status population is positively associated with a crime rate. The same can be said about the proportion of homes built before 1940.
Interestingly enough, the weighted mean of distance to employment centers has a strong negative correlation with crime.