This question should be answered using the Weekly data set, which is part of the ISLR package.
library(ISLR)
attach(Weekly)
dim(Weekly)
## [1] 1089 9
str(Weekly)
## 'data.frame': 1089 obs. of 9 variables:
## $ Year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ Lag1 : num 0.816 -0.27 -2.576 3.514 0.712 ...
## $ Lag2 : num 1.572 0.816 -0.27 -2.576 3.514 ...
## $ Lag3 : num -3.936 1.572 0.816 -0.27 -2.576 ...
## $ Lag4 : num -0.229 -3.936 1.572 0.816 -0.27 ...
## $ Lag5 : num -3.484 -0.229 -3.936 1.572 0.816 ...
## $ Volume : num 0.155 0.149 0.16 0.162 0.154 ...
## $ Today : num -0.27 -2.576 3.514 0.712 1.178 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
pairs(Weekly)
library(corrplot)
## corrplot 0.84 loaded
W <- cor(Weekly[,-9])
corrplot.mixed(W,lower.col='grey',order = "hclust")
From the graphs given above, the correlations between the lag variables and today’s returns are close to zero. In other words, there appears to be little correlation between today’s returns and previous days’ returns. The only substantial correlation is between Year and Volume.
index<-1:nrow(Weekly)
loessMod50 <- loess(Weekly$Volume ~ index, span=0.50)
smoothed50 <- predict(loessMod50)
plot(y=Weekly$Volume,x=index,pch=20,cex=0.1,col='#bdbdbd', main='Volume against Time',ylab="Volume",xlab="Index")
lines(smoothed50, x=index, col="#de2d26",lwd=2)
(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
glm.fits <- glm(Direction ~ Lag1+ Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family=binomial)
summary(glm.fits)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Based of the results, only Lag2 was statistically significant.
The predict() function can be used to predict the probability that the market will go up, given values of the predictors.
glm.probs <- predict(glm.fits, type="response")
glm.probs[1:10]
## 1 2 3 4 5 6 7 8
## 0.6086249 0.6010314 0.5875699 0.4816416 0.6169013 0.5684190 0.5786097 0.5151972
## 9 10
## 0.5715200 0.5554287
We know that these values correspond to the probability of the market going up, rather than down, because the contrasts() function indicates that R has created a dummy variable with a 1 for Up.
contrasts(Direction)
## Up
## Down 0
## Up 1
In order to make a prediction as to whether the market will go up or down on a particular day, we must convert these predicted probabilities into class labels, Up or Down.
glm.pred=rep("Down" ,1089)
glm.pred[glm.probs >.5]="Up"
(c) Compute the confusion matrix and overall fraction of correct predictions.
table(glm.pred, Direction)
## Direction
## glm.pred Down Up
## Down 54 48
## Up 430 557
Accuracy: (TN+TP)/(TN+FP+FN+TP)=(54 + 557) /1089 = 0.5611 or 56.11%
Error rate: (FP+FN)/(TP+FP+FN+TN) or 1 - accuracy
1 - 0.5611 = 0.4389 or 43.89%
Precision: TP / (FP + TP) = 557 /(48+557) = 92.07%
Sensitivity: TP / (TP+FN) = 557/(557+430) = 56.43%
Specificity: TN / (TN+FP) = 54 /(54+48) = 52.94%
(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor.
# training
train <- (Weekly$Year < 2009)
Weekly_train <- Weekly[train,]
Weekly_test <- Weekly[!train,]
Direction_train <- Weekly_train$Direction
Direction_test <- Weekly_test$Direction
# running the model using the training data
logistic_wkly <- glm(Direction ~ Lag2, data = Weekly, family = binomial)
# output
summary(logistic_wkly)
##
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.564 -1.267 1.008 1.086 1.386
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.21473 0.06123 3.507 0.000453 ***
## Lag2 0.06279 0.02636 2.382 0.017230 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1490.4 on 1087 degrees of freedom
## AIC: 1494.4
##
## Number of Fisher Scoring iterations: 4
# running the model using the testing data
logistic_probs <- predict(logistic_wkly, Weekly_test, type = "response")
logistic_pred = rep("Down", length(Direction_test))
logistic_pred[logistic_probs > 0.5] <- "Up"
table(logistic_pred, Direction_test)
## Direction_test
## logistic_pred Down Up
## Down 9 5
## Up 34 56
mean(logistic_pred == Direction_test)
## [1] 0.625
Accuracy: (TN+TP)/(TN+FP+FN+TP)= (9 + 56)/(9+5+34+56)= 65/104= 0.625= 62.5%
(e) Repeat (d) using LDA.
library(MASS)
lda.fit <- lda(Direction ~ Lag2, data= Weekly, subset=train)
lda.fit
## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
plot(lda.fit,col='#bcbddc')
# The predict() function returns a list with three elements.
lda.pred <-predict(lda.fit, Weekly_test)
names(lda.pred)
## [1] "class" "posterior" "x"
# confusion matrix
lda.class <-lda.pred$class
table(lda.class, Direction_test)
## Direction_test
## lda.class Down Up
## Down 9 5
## Up 34 56
# mean
mean(lda.class==Direction_test)
## [1] 0.625
The LDA output indicates that pi 1 = 0.448 and pi 2 = 0.552; in other words, 44.8% of the training observations correspond to days during which the market went down.
(f) Repeat (d) using QDA.
qda.fit <- qda(Direction ~ Lag2, data=Weekly, subset = train)
qda.fit
## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
# predict
qda.class=predict(qda.fit, Weekly_test)$class
# confusion matrix
table(qda.class,Direction_test)
## Direction_test
## qda.class Down Up
## Down 0 0
## Up 43 61
# mean
mean(qda.class==Direction_test)
## [1] 0.5865385
Accuracy: (TN+TP)/(TN+FP+FN+TP)= 58.65%
(g) Repeat (d) using KNN with K = 1.
library(class)
train.X <- cbind(Lag2[train])
test.X <- cbind(Lag2[!train])
train.Direction=Direction[train]
# seed must be set in order to ensure reproducibility of results
set.seed(1)
knn.pred <- knn(train.X, test.X, train.Direction, k=1)
table(knn.pred, Direction_test)
## Direction_test
## knn.pred Down Up
## Down 21 30
## Up 22 31
mean(knn.pred==Direction_test)
## [1] 0.5
Accuracy: (TN+TP)/(TN+FP+FN+TP)= 50%
(h) Which of these methods appears to provide the best results on this data?
Logistic
LDA
QDA
KNN
(i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods.
glm.fits2 <- glm(Direction ~ Lag1*Lag2, data = Weekly, family=binomial)
summary(glm.fits)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
# running the model using the testing data
logistic_probs2 <- predict(logistic_wkly, Weekly_test, type = "response")
logistic_pred2 <- rep("Down", length(Direction_test))
logistic_pred2[logistic_probs > 0.5] <- "Up"
table(logistic_pred2, Direction_test)
## Direction_test
## logistic_pred2 Down Up
## Down 9 5
## Up 34 56
mean(logistic_pred2 == Direction_test)
## [1] 0.625
lda.fit2 <- lda(Direction ~ Lag1*Lag2, data= Weekly, subset=train)
lda.fit2
## Call:
## lda(Direction ~ Lag1 * Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag1 Lag2 Lag1:Lag2
## Down 0.289444444 -0.03568254 -0.8014495
## Up -0.009213235 0.26036581 -0.1393632
##
## Coefficients of linear discriminants:
## LD1
## Lag1 -0.285484602
## Lag2 0.295080109
## Lag1:Lag2 0.009629381
plot(lda.fit2,col='#bcbddc')
# The predict() function returns a list with three elements.
lda.pred2 <-predict(lda.fit, Weekly_test)
names(lda.pred2)
## [1] "class" "posterior" "x"
# confusion matrix
lda.class2 <-lda.pred2$class
table(lda.class2, Direction_test)
## Direction_test
## lda.class2 Down Up
## Down 9 5
## Up 34 56
# mean
mean(lda.class2==Direction_test)
## [1] 0.625
qda.fit2 <- qda(Direction ~ Lag1*Lag2, data=Weekly, subset = train)
qda.fit2
## Call:
## qda(Direction ~ Lag1 * Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag1 Lag2 Lag1:Lag2
## Down 0.289444444 -0.03568254 -0.8014495
## Up -0.009213235 0.26036581 -0.1393632
# predict
qda.class2=predict(qda.fit2, Weekly_test)$class
# confusion matrix
table(qda.class2,Direction_test)
## Direction_test
## qda.class2 Down Up
## Down 23 36
## Up 20 25
# mean
mean(qda.class2==Direction_test)
## [1] 0.4615385
library(class)
train.X2 <- cbind(Lag2[train])
test.X2<- cbind(Lag2[!train])
train.Direction2 <- Direction[train]
# seed must be set in order to ensure reproducibility of results
set.seed(1)
knn.pred2 <- knn(train.X2, test.X2, train.Direction, k=10)
table(knn.pred2, Direction_test)
## Direction_test
## knn.pred2 Down Up
## Down 17 21
## Up 26 40
mean(knn.pred2==Direction_test)
## [1] 0.5480769
In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
library(ISLR)
attach(Auto)
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median
mpg01 <- rep(0, length(Auto$mpg))
mpg01[Auto$mpg > median(Auto$mpg)] <- 1
Auto <- data.frame(Auto, mpg01)
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
## mpg01
## Min. :0.0
## 1st Qu.:0.0
## Median :0.5
## Mean :0.5
## 3rd Qu.:1.0
## Max. :1.0
##
(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01?
library(corrplot)
cor(Auto[, -9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## mpg01 0.8369392 -0.7591939 -0.7534766 -0.6670526 -0.7577566
## acceleration year origin mpg01
## mpg 0.4233285 0.5805410 0.5652088 0.8369392
## cylinders -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration 1.0000000 0.2903161 0.2127458 0.3468215
## year 0.2903161 1.0000000 0.1815277 0.4299042
## origin 0.2127458 0.1815277 1.0000000 0.5136984
## mpg01 0.3468215 0.4299042 0.5136984 1.0000000
M<-cor(Auto[,-9])
corrplot.mixed(M,lower.col='grey',order = "hclust")
(c) Split the data into a training set and a test set.
set.seed(1)
train <- sample(1:dim(Auto)[1], dim(Auto)[1]*.7, rep=FALSE)
test <- -train
training_data<- Auto[train, ]
testing_data= Auto[test, ]
mpg01.test <- mpg01[test]
(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01
lda_model <- lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = training_data)
lda_model
## Call:
## lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = training_data)
##
## Prior probabilities of groups:
## 0 1
## 0.4927007 0.5072993
##
## Group means:
## cylinders weight displacement horsepower
## 0 6.777778 3611.052 271.9333 129.13333
## 1 4.187050 2342.165 116.8129 79.27338
##
## Coefficients of linear discriminants:
## LD1
## cylinders -0.3962357999
## weight -0.0008321338
## displacement -0.0047630097
## horsepower 0.0061919395
lda_pred = predict(lda_model, testing_data)
names(lda_pred)
## [1] "class" "posterior" "x"
# compute the confusion matrix
pred.lda <- predict(lda_model, testing_data)
table(pred.lda$class, mpg01.test)
## mpg01.test
## 0 1
## 0 50 3
## 1 11 54
mean(pred.lda$class != mpg01.test)
## [1] 0.1186441
(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01
qda_model = qda(mpg01 ~ cylinders + horsepower + weight + acceleration, data=training_data)
qda_model
## Call:
## qda(mpg01 ~ cylinders + horsepower + weight + acceleration, data = training_data)
##
## Prior probabilities of groups:
## 0 1
## 0.4927007 0.5072993
##
## Group means:
## cylinders horsepower weight acceleration
## 0 6.777778 129.13333 3611.052 14.71556
## 1 4.187050 79.27338 2342.165 16.57338
# compute the confusion matrix
qda.class=predict(qda_model, testing_data)$class
table(qda.class, testing_data$mpg01)
##
## qda.class 0 1
## 0 52 5
## 1 9 52
mean(qda.class != testing_data$mpg01)
## [1] 0.1186441
(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01
glm_model <- glm(mpg01 ~ cylinders + weight + displacement + horsepower, data = training_data, family = binomial)
summary(glm_model)
##
## Call:
## glm(formula = mpg01 ~ cylinders + weight + displacement + horsepower,
## family = binomial, data = training_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4794 -0.1963 0.1056 0.3508 3.3756
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.725290 2.147421 5.460 4.76e-08 ***
## cylinders 0.056770 0.419131 0.135 0.8923
## weight -0.001931 0.000817 -2.364 0.0181 *
## displacement -0.014718 0.009904 -1.486 0.1373
## horsepower -0.041518 0.017821 -2.330 0.0198 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 379.79 on 273 degrees of freedom
## Residual deviance: 144.49 on 269 degrees of freedom
## AIC: 154.49
##
## Number of Fisher Scoring iterations: 7
# predict
probs <- predict(glm_model, testing_data, type = "response")
pred.glm <- rep(0, length(probs))
pred.glm[probs > 0.5] <- 1
table(pred.glm, mpg01.test)
## mpg01.test
## pred.glm 0 1
## 0 53 3
## 1 8 54
# accuracy
mean(pred.glm != mpg01.test)
## [1] 0.09322034
Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors.
library(MASS)
attach(Boston)
Explore the data
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
pairs(Boston)
Create response variable
crim01 <- rep(0, length(Boston$crim))
crim01[Boston$crim > median(Boston$crim)] <- 1
Boston <- data.frame(Boston, crim01)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv crim01
## Min. : 1.73 Min. : 5.00 Min. :0.0
## 1st Qu.: 6.95 1st Qu.:17.02 1st Qu.:0.0
## Median :11.36 Median :21.20 Median :0.5
## Mean :12.65 Mean :22.53 Mean :0.5
## 3rd Qu.:16.95 3rd Qu.:25.00 3rd Qu.:1.0
## Max. :37.97 Max. :50.00 Max. :1.0
Create training and testing data
set.seed(1234)
train <- sample(1:dim(Boston)[1], dim(Boston)[1]*.7, rep=FALSE)
test <- -train
Boston.train <- Boston[train, ]
Boston.test <- Boston[test, ]
crim01.test <- crim01[test]
Fit a logistic model
fit.glm1 <- glm(crim01 ~ . - crim01 - crim, data = Boston, family = binomial, subset = train)
fit.glm1
##
## Call: glm(formula = crim01 ~ . - crim01 - crim, family = binomial,
## data = Boston, subset = train)
##
## Coefficients:
## (Intercept) zn indus chas nox rm
## -38.403844 -0.088513 -0.061222 1.575716 54.122064 -0.599424
## age dis rad tax ptratio black
## 0.016070 0.915126 0.680109 -0.005248 0.313389 -0.010162
## lstat medv
## 0.073570 0.204201
##
## Degrees of Freedom: 353 Total (i.e. Null); 340 Residual
## Null Deviance: 490.7
## Residual Deviance: 139.8 AIC: 167.8
# predict
probs <- predict(fit.glm1, Boston.test, type = "response")
pred.glm <- rep(0, length(probs))
pred.glm[probs > 0.5] <- 1
table(pred.glm, crim01.test)
## crim01.test
## pred.glm 0 1
## 0 67 12
## 1 8 65
# accuracy
mean(pred.glm != crim01.test)
## [1] 0.1315789