library(ISLR)
attach(Weekly)
(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
head(Weekly)
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 1 1990 0.816 1.572 -3.936 -0.229 -3.484 0.1549760 -0.270 Down
## 2 1990 -0.270 0.816 1.572 -3.936 -0.229 0.1485740 -2.576 Down
## 3 1990 -2.576 -0.270 0.816 1.572 -3.936 0.1598375 3.514 Up
## 4 1990 3.514 -2.576 -0.270 0.816 1.572 0.1616300 0.712 Up
## 5 1990 0.712 3.514 -2.576 -0.270 0.816 0.1537280 1.178 Up
## 6 1990 1.178 0.712 3.514 -2.576 -0.270 0.1544440 -1.372 Down
pairs(Weekly)
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
cor(Weekly[,-9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
The pairs output indicates that a relationship may exist between Volume and Year. The correlation confirms this with an 0.84 output. No other patterns readily emerge from the data.
(b) Use the full data set to perform a logistic regression with direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
glm.weekly=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Weekly, family = binomial)
summary(glm.weekly)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Lag2 is the only predictor showing significance and has a p-value of 0.0296.
(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
create a probability vector using the predict function
glm.probs=predict(glm.weekly,type='response')
#this will print the first 10 obs and their probability
glm.probs[1:10]
## 1 2 3 4 5 6 7 8
## 0.6086249 0.6010314 0.5875699 0.4816416 0.6169013 0.5684190 0.5786097 0.5151972
## 9 10
## 0.5715200 0.5554287
contrasts(Direction)
## Up
## Down 0
## Up 1
To convert the predicted probabilities to a class level based on probability value, use a rep function to replicate elements x amount of times. In this case 1089 times.
glm.pred=rep("Down", 1089)
create logic to reassign the class based on the probability value out of the probability vector.
glm.pred[glm.probs>0.5]="Up"
glm.pred[1:10]
## [1] "Up" "Up" "Up" "Down" "Up" "Up" "Up" "Up" "Up" "Up"
# call the prediction table and compare the predictive accuracy
table(glm.pred, Direction)
## Direction
## glm.pred Down Up
## Down 54 48
## Up 430 557
#accuracy
mean(glm.pred==Direction)
## [1] 0.5610652
#misclassification
mean(glm.pred!=Direction)
## [1] 0.4389348
The confusion matrix is informing us of the accuracy and misclassification rates. The accuracy rate tells us how often our model accurately predicted the outcome. The error/misclassification rate tells us how often our model inaccurately predicted the outcome. Accuracy Rate=56%
(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
#This will add a train column with a TRUE/False identifier if the year is less than 2009 to the weekly dataset
train=(Year<2009)
#create actual training set by using logic to only pull back those with false from steps above. The !train tells r that you want those that do not equal true
Weekly.09_10=Weekly[!train ,]
#Create a dataset to hold the outcome variable for the training dataset
Direction.09_10= Direction[!train]
dim(Weekly)
## [1] 1089 9
dim(Weekly.09_10)
## [1] 104 9
head(Weekly.09_10)
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 986 2009 6.760 -1.698 0.926 0.418 -2.251 3.793110 -4.448 Down
## 987 2009 -4.448 6.760 -1.698 0.926 0.418 5.043904 -4.518 Down
## 988 2009 -4.518 -4.448 6.760 -1.698 0.926 5.948758 -2.137 Down
## 989 2009 -2.137 -4.518 -4.448 6.760 -1.698 6.129763 -0.730 Down
## 990 2009 -0.730 -2.137 -4.518 -4.448 6.760 5.602004 5.173 Up
## 991 2009 5.173 -0.730 -2.137 -4.518 -4.448 6.217632 -4.808 Down
head(Direction.09_10)
## [1] Down Down Down Down Up Down
## Levels: Down Up
#create model using train dataset
glm.fits=glm(Direction~Lag2, data=Weekly,family=binomial ,subset=train)
summary(glm.fits)
##
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly,
## subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.536 -1.264 1.021 1.091 1.368
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
#create a vector to store probabilities of the test data set, using the model from the training data set.
glm.probs=predict(glm.fits,Weekly.09_10, type="response")
To convert the predicted probabilities to a class level based on probability value, use a rep function to replicate elements x amount of times. In this case 104 times.
glm.pred=rep("Down",104)
#create logic to reassign the class based on the probability value out of the probability vector.
glm.pred[glm.probs >.5]="Up"
table(glm.pred,Direction.09_10)
## Direction.09_10
## glm.pred Down Up
## Down 9 5
## Up 34 56
#test set error rate
mean(glm.pred!=Direction.09_10)
## [1] 0.375
#test set accuracy rate
mean(glm.pred==Direction.09_10)
## [1] 0.625
Accuracy Rate= 62.5%
(e) Repeat (d) using LDA.
Must call the mass library to use the LDA function. There is no need for a family statement because LDA is its own class of model
library(MASS)
Create an object called LDA fit and fit the LDA
lda.fit=lda(Direction~Lag2, data=Weekly, subset = train)
lda.fit
## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
Use predict function to return predictive probabilities
This says to take the LDA fit and run it against the test data
lda.pred=predict(lda.fit, Weekly.09_10)
create a lda class object to store the class element from the lda predictor. This will allow me to make the confusion matrix
lda.class=lda.pred$class
table(lda.class,Direction.09_10)
## Direction.09_10
## lda.class Down Up
## Down 9 5
## Up 34 56
#accuracy Rate/error
mean(lda.class == Direction.09_10)
## [1] 0.625
#misclassification rate
mean(lda.class != Direction.09_10)
## [1] 0.375
Accuracy Rate = 62.5%
(f) Repeat (d) using QDA.
Will not provide coefficients because QDA is quadratic and not linear. With this model, prediction is our goal. Inference does not matter as much.
qda.fit=qda(Direction~Lag2,data=Weekly, subset=train)
qda.fit
## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
qda.class=predict(qda.fit, Weekly.09_10)$class
qda.class
## [1] Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up
## [26] Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up
## [51] Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up
## [76] Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up
## [101] Up Up Up Up
## Levels: Down Up
table(qda.class, Direction.09_10)
## Direction.09_10
## qda.class Down Up
## Down 0 0
## Up 43 61
#accuracy Rate/error
mean(qda.class == Direction.09_10)
## [1] 0.5865385
#misclassification rate
mean(qda.class != Direction.09_10)
## [1] 0.4134615
Accuracy Rate = 58.7%
(g) Repeat (d) using KNN with K = 1.
Takes 4 arguments. 1. matrix of predictors as training data set 2. matrix of predictors as test data set 3. vector of classes for training observations 4. pass a value for k which is nearest neighbors level will use cbind to bind lag1 and lag2 together to form two matrices for test and training
library(class)
# bind lag2 for training dataset
train.X <- as.matrix(Lag2[train])
#bind lag2 for test dataset
test.X <- as.matrix(Lag2[!train])
#create Direction info for train set
train.Direction=Direction[train]
dim(train.X)
## [1] 985 1
dim(test.X)
## [1] 104 1
length(train.Direction)
## [1] 985
Create the KNN. Must set a seed to make results reproducible
set.seed(1)
knn.pred=knn(train.X, test.X, train.Direction, k=1)
# show results for prediction vs test data set
table(knn.pred, Direction.09_10)
## Direction.09_10
## knn.pred Down Up
## Down 21 30
## Up 22 31
#accuracy Rate/error
mean(knn.pred == Direction.09_10)
## [1] 0.5
#misclassification rate
mean(knn.pred != Direction.09_10)
## [1] 0.5
Accuracry Rate= 50%
(h) Which of these methods appears to provide the best results on this data?
The Logistic and LDA models provided the best results at 62.5% accuracy rates.
(i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.
# LDA Trial
lda.fit=lda(Direction~Lag2 + I(Lag1^2), data=Weekly, subset = train)
lda.fit
## Call:
## lda(Direction ~ Lag2 + I(Lag1^2), data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2 I(Lag1^2)
## Down -0.03568254 4.964397
## Up 0.26036581 5.318940
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.45006708
## I(Lag1^2) 0.02767883
#This says to take the LDA fit and run it against the test data
lda.pred=predict(lda.fit, Weekly.09_10)
#create a lda class object to store the class element from the lda predictor. This will allow me to make the confusion matrix
lda.class=lda.pred$class
table(lda.class,Direction.09_10)
## Direction.09_10
## lda.class Down Up
## Down 8 2
## Up 35 59
#accuracy Rate/error
mean(lda.class == Direction.09_10)
## [1] 0.6442308
#misclassification rate
mean(lda.class != Direction.09_10)
## [1] 0.3557692
A model that incorporates Lag2 and transformation of Lag1 produces the best accuracy at 64.4%. Of interest is that the KNN model was able to get very close to the prior 62.5% baseline with K=4, it in fact came in at 61.5% at that break point.
attach(Auto)
(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and other Auto variables.
mpg01<-ifelse(Auto$mpg > median(Auto$mpg), c(1), c(0))
Auto2<- data.frame(Auto, mpg01)
summary(Auto2)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
## mpg01
## Min. :0.0
## 1st Qu.:0.0
## Median :0.5
## Mean :0.5
## 3rd Qu.:1.0
## Max. :1.0
##
(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
attach(Auto2)
pairs(Auto2[,-9])
par(mfrow=c(2,2))
boxplot(mpg~cylinders)
plot(mpg~horsepower)
plot((mpg~acceleration))
Horsepower Cylinders Weight Displacement seem like they will be the most helpful predictors based on the boxplot and pairs outputs. we can see that a relationship exists between those predictors and the `mpg01’ outcome variable.
(c) Split the data into a training set and a test set.
# Create a 70/30 train/test
set.seed(1)
#trainid will be the variable that holds the TRUE/FALSE indicator
trainid <- sample(1:nrow(Auto2), nrow(Auto2)*0.7 , replace=F) # 70% train, 30% test
train <- Auto2[trainid,]
test <- Auto2[-trainid,]
(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
library(MASS)
lda.fit=lda(mpg01~horsepower+cylinders+ weight+displacement,data=train)
lda.fit
## Call:
## lda(mpg01 ~ horsepower + cylinders + weight + displacement, data = train)
##
## Prior probabilities of groups:
## 0 1
## 0.4927007 0.5072993
##
## Group means:
## horsepower cylinders weight displacement
## 0 129.13333 6.777778 3611.052 271.9333
## 1 79.27338 4.187050 2342.165 116.8129
##
## Coefficients of linear discriminants:
## LD1
## horsepower 0.0061919395
## cylinders -0.3962357999
## weight -0.0008321338
## displacement -0.0047630097
lda.pred=predict(lda.fit, test)
lda.class=lda.pred$class
table(lda.class,test$mpg01)
##
## lda.class 0 1
## 0 50 3
## 1 11 54
#accuracy Rate
mean(lda.class == test$mpg01)
## [1] 0.8813559
#misclassification rate/Error
mean(lda.class != test$mpg01)
## [1] 0.1186441
Test Error: 11.8%
(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
qda.fit=qda(mpg01~horsepower+cylinders+ weight+displacement,data=train)
qda.fit
## Call:
## qda(mpg01 ~ horsepower + cylinders + weight + displacement, data = train)
##
## Prior probabilities of groups:
## 0 1
## 0.4927007 0.5072993
##
## Group means:
## horsepower cylinders weight displacement
## 0 129.13333 6.777778 3611.052 271.9333
## 1 79.27338 4.187050 2342.165 116.8129
qda.class=predict(qda.fit, test)$class
qda.class
## [1] 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 1 1 1 0 1 1 0 1
## [38] 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 0 1 0 0 1
## [75] 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 0
## [112] 0 1 1 1 0 1 1
## Levels: 0 1
table(qda.class,test$mpg01)
##
## qda.class 0 1
## 0 52 5
## 1 9 52
#accuracy Rate
mean(qda.class == test$mpg01)
## [1] 0.8813559
#misclassification rate/Error Rate
mean(qda.class != test$mpg01)
## [1] 0.1186441
Test Error Rate: 11.8%
(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
#creates model from the training data
glm.fit2=glm(mpg01~displacement+horsepower+weight+cylinders, data=train, family=binomial)
#tests the model on the test set
glm.probs=predict(glm.fit2, test, type= 'response')
#head(glm.probs)
glm.pred2=ifelse(glm.probs > 0.5, '1', '0')
table(glm.pred2, test$mpg01)
##
## glm.pred2 0 1
## 0 53 3
## 1 8 54
#accuracy Rate
mean(glm.pred2== test$mpg01)
## [1] 0.9067797
#misclassification rate/Error Rate
mean(glm.pred2!= test$mpg01)
## [1] 0.09322034
Test Error Rate: 9.3%
(g) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?
Takes 4 arguments. 1. matrix of predictors as training data set 2. matrix of predictors as test data set 3. vector of classes for training observations 4. pass a value for k which is nearest neighbors level will use cbind to bind lag1 and lag2 together to form two matrices for test and training
set.seed(1)
#create new dataset and add trainid column to indicate if it will be in the training or testing sample using True/False in the trainid column.
trainid <- sample(1:nrow(Auto2), nrow(Auto2)*0.7 , replace=F) # 70% train, 30% test
train <- Auto2[trainid,]
test <- Auto2[-trainid,]
library(class)
#Create matrix of predictors for training set
train.x=cbind(train$displacement,train$horsepower,train$weight,train$cylinders)
#Create matrix of predictors for testing set
test.x=cbind(test$displacement,test$horsepower,test$weight,test$cylinders)
#create vector of classes for training observations
train.mpg01=train$mpg01
#Check dimensions
dim(train.x)
## [1] 274 4
dim(test.x)
## [1] 118 4
length(train.mpg01)
## [1] 274
Create the KNN. Must set a seed to make results reproducible
set.seed(1)
knn.pred=knn(train.x, test.x, train.mpg01, k=3)
# show results for prediction vs test data set
table(knn.pred, test$mpg01)
##
## knn.pred 0 1
## 0 52 4
## 1 9 53
#accuracy Rate
mean(knn.pred == test$mpg01)
## [1] 0.8898305
#misclassification rate/Error Rate
mean(knn.pred != test$mpg01)
## [1] 0.1101695
The best value of K seems to be 3 and produces an error rate of 11%
attach(Boston)
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
#Create new variable to hold binary and indicate if crime rate is above or below median
crime01<-ifelse(Boston$crim>median(Boston$crim), c(1), c(0))
bostoncrime<-data.frame(Boston, crime01)
summary(bostoncrime$crime01)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.5 0.5 1.0 1.0
cor(bostoncrime)
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## crime01 0.40939545 -0.43615103 0.60326017 0.070096774 0.72323480
## rm age dis rad tax ptratio
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
## crime01 -0.15637178 0.61393992 -0.61634164 0.619786249 0.60874128 0.2535684
## black lstat medv crime01
## crim -0.38506394 0.4556215 -0.3883046 0.40939545
## zn 0.17552032 -0.4129946 0.3604453 -0.43615103
## indus -0.35697654 0.6037997 -0.4837252 0.60326017
## chas 0.04878848 -0.0539293 0.1752602 0.07009677
## nox -0.38005064 0.5908789 -0.4273208 0.72323480
## rm 0.12806864 -0.6138083 0.6953599 -0.15637178
## age -0.27353398 0.6023385 -0.3769546 0.61393992
## dis 0.29151167 -0.4969958 0.2499287 -0.61634164
## rad -0.44441282 0.4886763 -0.3816262 0.61978625
## tax -0.44180801 0.5439934 -0.4685359 0.60874128
## ptratio -0.17738330 0.3740443 -0.5077867 0.25356836
## black 1.00000000 -0.3660869 0.3334608 -0.35121093
## lstat -0.36608690 1.0000000 -0.7376627 0.45326273
## medv 0.33346082 -0.7376627 1.0000000 -0.26301673
## crime01 -0.35121093 0.4532627 -0.2630167 1.00000000
pairs(bostoncrime)
par(mfrow=c(2,2))
plot(crim~tax)
plot(crim~nox)
plot(crim~dis)
plot(crim~age)
par(mfrow=c(2,2))
boxplot(crim~chas)
boxplot((crim~rad))
set.seed(1)
#create new dataset and add trainid column to indicate if it will be in the training or testing sample using True/False in the trainid column.
trainid <- sample(1:nrow(bostoncrime), nrow(bostoncrime)*0.7 , replace=F) # 70% train, 30% test
train <- bostoncrime[trainid,]
test <- bostoncrime[-trainid,]
#Create logistic model
logistic.fit<-glm(crime01~tax + I(rad^2) + nox, data=train, family = binomial)
summary(logistic.fit)
##
## Call:
## glm(formula = crime01 ~ tax + I(rad^2) + nox, family = binomial,
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.87652 -0.32888 -0.06167 0.00000 2.44944
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -17.727046 2.315947 -7.654 1.94e-14 ***
## tax -0.007050 0.002404 -2.933 0.00336 **
## I(rad^2) 0.058470 0.012196 4.794 1.63e-06 ***
## nox 34.103945 4.684851 7.280 3.35e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 490.65 on 353 degrees of freedom
## Residual deviance: 176.79 on 350 degrees of freedom
## AIC: 184.79
##
## Number of Fisher Scoring iterations: 9
#Create probability vector to store probabilites of outcome variable
logistic.probs<-predict(logistic.fit, test, type='response')
#Create binary prediction based on probabilites
logistic.pred=ifelse(logistic.probs > 0.5, '1', '0')
#create confusion matrix
table(logistic.pred, test$crime01)
##
## logistic.pred 0 1
## 0 69 14
## 1 4 65
#test set accuracy rate
mean(logistic.pred==test$crime01)
## [1] 0.8815789
#Test Misclassification/Error Rate
mean(logistic.pred!=test$crime01)
## [1] 0.1184211
After starting with tax rad dis age nox chas and indus, the logistic output informed that tax rad dis and nox were the only significant predictors. Thus a new model was fit. Dis was ultimately dropped and a transformation was performed on rad.The final logistic output’s error is 11.8%.
#Create LDA Model
lda.fit=lda(crime01 ~ I(rad^2) + nox ,data=train)
lda.fit
## Call:
## lda(crime01 ~ I(rad^2) + nox, data = train)
##
## Prior probabilities of groups:
## 0 1
## 0.5084746 0.4915254
##
## Group means:
## I(rad^2) nox
## 0 19.92778 0.4695150
## 1 311.44828 0.6377989
##
## Coefficients of linear discriminants:
## LD1
## I(rad^2) 0.002269416
## nox 9.548628235
lda.pred=predict(lda.fit, test)
lda.class=lda.pred$class
table(lda.class,test$crime01)
##
## lda.class 0 1
## 0 72 20
## 1 1 59
#accuracy Rate
mean(lda.class == test$crime01)
## [1] 0.8618421
#misclassification rate/Error
mean(lda.class != test$crime01)
## [1] 0.1381579
After changing out the variables it seems that rad and nox are the best predictors for the LDA model. The LDA test error is 13.8%
#Create a QDA model
qda.fit=qda(crime01 ~ dis + I(rad^2) + nox ,data=train)
qda.fit
## Call:
## qda(crime01 ~ dis + I(rad^2) + nox, data = train)
##
## Prior probabilities of groups:
## 0 1
## 0.5084746 0.4915254
##
## Group means:
## dis I(rad^2) nox
## 0 5.070315 19.92778 0.4695150
## 1 2.537681 311.44828 0.6377989
qda.class=predict(qda.fit, test)$class
qda.class
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [38] 0 0 0 0 0 0 0 1 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0
## [75] 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
## [149] 0 0 0 0
## Levels: 0 1
table(qda.class,test$crime01)
##
## qda.class 0 1
## 0 73 20
## 1 0 59
#accuracy Rate
mean(qda.class == test$crime01)
## [1] 0.8684211
#misclassification rate/Error Rate
mean(qda.class != test$crime01)
## [1] 0.1315789
It is interesting to note that the QDA model prefers dis over tax. The resulting test error rate is: 13.2%
# Create a knn model
#create a matrix of predictors for the training and test set
train.x<-cbind(train$dis, train$rad, train$nox, train$tax)
test.x<-cbind(test$dis, test$rad, test$nox, test$tax)
#create a vector of classes for training observations
train.crim01<-train$crime01
#Check dimensions
dim(train.x)
## [1] 354 4
dim(test.X)
## [1] 104 1
length(train.crim01)
## [1] 354
set.seed(1)
knn.pred=knn(train.x, test.x, train.crim01, k=1)
# show results for prediction vs test data set
table(knn.pred, test$crime01)
##
## knn.pred 0 1
## 0 68 2
## 1 5 77
#accuracy Rate
mean(knn.pred == test$crime01)
## [1] 0.9539474
#misclassification rate/Error Rate
mean(knn.pred != test$crime01)
## [1] 0.04605263
After fitting various knn models, it was found that knn=1 is the best fit and produces an error rate of 4.6% This model contained dis rad nox and tax.