DATA MINING HW3

Chapter 04 (page 168): 10, 11, 13

Problem 10

This question should be answered using the Weekly data set, which is part of the ISLR package.

library(ISLR)
attach(Weekly)

dim(Weekly)

## [1] 1089    9

str(Weekly)

## 'data.frame':    1089 obs. of  9 variables:
##  $ Year     : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ Lag1     : num  0.816 -0.27 -2.576 3.514 0.712 ...
##  $ Lag2     : num  1.572 0.816 -0.27 -2.576 3.514 ...
##  $ Lag3     : num  -3.936 1.572 0.816 -0.27 -2.576 ...
##  $ Lag4     : num  -0.229 -3.936 1.572 0.816 -0.27 ...
##  $ Lag5     : num  -3.484 -0.229 -3.936 1.572 0.816 ...
##  $ Volume   : num  0.155 0.149 0.16 0.162 0.154 ...
##  $ Today    : num  -0.27 -2.576 3.514 0.712 1.178 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...

summary(Weekly)

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
##

(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

pairs(Weekly)

library(corrplot)

## corrplot 0.84 loaded

W <- cor(Weekly[,-9])
corrplot.mixed(W,lower.col='grey',order = "hclust")

From the graphs given above, the correlations between the lag variables and today’s returns are close to zero. In other words, there appears to be little correlation between today’s returns and previous days’ returns. The only substantial correlation is between Year and Volume.

index<-1:nrow(Weekly)
loessMod50 <- loess(Weekly$Volume ~ index, span=0.50)
smoothed50 <- predict(loessMod50)
plot(y=Weekly$Volume,x=index,pch=20,cex=0.1,col='#bdbdbd', main='Volume against Time',ylab="Volume",xlab="Index")
lines(smoothed50, x=index, col="#de2d26",lwd=2)

(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

glm.fits <- glm(Direction ~ Lag1+ Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family=binomial)
summary(glm.fits)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Based of the results, only Lag2 was statistically significant.

The predict() function can be used to predict the probability that the market will go up, given values of the predictors.

glm.probs <- predict(glm.fits, type="response")
glm.probs[1:10]

##         1         2         3         4         5         6         7         8 
## 0.6086249 0.6010314 0.5875699 0.4816416 0.6169013 0.5684190 0.5786097 0.5151972 
##         9        10 
## 0.5715200 0.5554287

We know that these values correspond to the probability of the market going up, rather than down, because the contrasts() function indicates that R has created a dummy variable with a 1 for Up.

contrasts(Direction)

##      Up
## Down  0
## Up    1

In order to make a prediction as to whether the market will go up or down on a particular day, we must convert these predicted probabilities into class labels, Up or Down.

glm.pred=rep("Down" ,1089)
glm.pred[glm.probs >.5]="Up"

(c) Compute the confusion matrix and overall fraction of correct predictions.

table(glm.pred, Direction)

##         Direction
## glm.pred Down  Up
##     Down   54  48
##     Up    430 557

Accuracy: (TN+TP)/(TN+FP+FN+TP)=(54 + 557) /1089 = 0.5611 or 56.11%
Error rate: (FP+FN)/(TP+FP+FN+TN) or 1 - accuracy
1 - 0.5611 = 0.4389 or 43.89%
Precision: TP / (FP + TP) = 557 /(48+557) = 92.07%
Sensitivity: TP / (TP+FN) = 557/(557+430) = 56.43%
Specificity: TN / (TN+FP) = 54 /(54+48) = 52.94%
(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor.

# training
train <- (Weekly$Year < 2009)
Weekly_train <- Weekly[train,]
Weekly_test <- Weekly[!train,]
Direction_train <- Weekly_train$Direction
Direction_test <- Weekly_test$Direction

# running the model using the training data
logistic_wkly <- glm(Direction ~ Lag2, data = Weekly, family = binomial)

# output 
summary(logistic_wkly)

## 
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.564  -1.267   1.008   1.086   1.386  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.21473    0.06123   3.507 0.000453 ***
## Lag2         0.06279    0.02636   2.382 0.017230 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1490.4  on 1087  degrees of freedom
## AIC: 1494.4
## 
## Number of Fisher Scoring iterations: 4

# running the model using the testing data 
logistic_probs <- predict(logistic_wkly, Weekly_test, type = "response")
logistic_pred = rep("Down", length(Direction_test))
logistic_pred[logistic_probs > 0.5] <- "Up"
table(logistic_pred, Direction_test)

##              Direction_test
## logistic_pred Down Up
##          Down    9  5
##          Up     34 56

mean(logistic_pred == Direction_test)

## [1] 0.625

Accuracy: (TN+TP)/(TN+FP+FN+TP)= (9 + 56)/(9+5+34+56)= 65/104= 0.625= 62.5%

(e) Repeat (d) using LDA.

library(MASS)
lda.fit <- lda(Direction ~ Lag2, data= Weekly, subset=train)
lda.fit

## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162

plot(lda.fit,col='#bcbddc')

# The predict() function returns a list with three elements. 
lda.pred <-predict(lda.fit, Weekly_test)
names(lda.pred)

## [1] "class"     "posterior" "x"

# confusion matrix
lda.class <-lda.pred$class
table(lda.class, Direction_test)

##          Direction_test
## lda.class Down Up
##      Down    9  5
##      Up     34 56

# mean 
mean(lda.class==Direction_test)

## [1] 0.625

The LDA output indicates that pi 1 = 0.448 and pi 2 = 0.552; in other words, 44.8% of the training observations correspond to days during which the market went down.

(f) Repeat (d) using QDA.

qda.fit <- qda(Direction ~ Lag2, data=Weekly, subset = train)
qda.fit

## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581

# predict
qda.class=predict(qda.fit, Weekly_test)$class
# confusion matrix
table(qda.class,Direction_test)

##          Direction_test
## qda.class Down Up
##      Down    0  0
##      Up     43 61

# mean 
mean(qda.class==Direction_test)

## [1] 0.5865385

Accuracy: (TN+TP)/(TN+FP+FN+TP)= 58.65%

(g) Repeat (d) using KNN with K = 1.

library(class)
train.X <- cbind(Lag2[train])
test.X <- cbind(Lag2[!train])
train.Direction=Direction[train]
# seed must be set in order to ensure reproducibility of results
set.seed(1)
knn.pred <- knn(train.X, test.X, train.Direction, k=1)
table(knn.pred, Direction_test)

##         Direction_test
## knn.pred Down Up
##     Down   21 30
##     Up     22 31

mean(knn.pred==Direction_test)

## [1] 0.5

Accuracy: (TN+TP)/(TN+FP+FN+TP)= 50%

(h) Which of these methods appears to provide the best results on this data?
Logistic
LDA
QDA
KNN

(i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods.

glm.fits2 <- glm(Direction ~ Lag1*Lag2, data = Weekly, family=binomial)
summary(glm.fits)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

# running the model using the testing data 
logistic_probs2 <- predict(logistic_wkly, Weekly_test, type = "response")
logistic_pred2 <- rep("Down", length(Direction_test))
logistic_pred2[logistic_probs > 0.5] <- "Up"
table(logistic_pred2, Direction_test)

##               Direction_test
## logistic_pred2 Down Up
##           Down    9  5
##           Up     34 56

mean(logistic_pred2 == Direction_test)

## [1] 0.625

lda.fit2 <- lda(Direction ~ Lag1*Lag2, data= Weekly, subset=train)
lda.fit2

## Call:
## lda(Direction ~ Lag1 * Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##              Lag1        Lag2  Lag1:Lag2
## Down  0.289444444 -0.03568254 -0.8014495
## Up   -0.009213235  0.26036581 -0.1393632
## 
## Coefficients of linear discriminants:
##                    LD1
## Lag1      -0.285484602
## Lag2       0.295080109
## Lag1:Lag2  0.009629381

plot(lda.fit2,col='#bcbddc')

# The predict() function returns a list with three elements. 
lda.pred2 <-predict(lda.fit, Weekly_test)
names(lda.pred2)

## [1] "class"     "posterior" "x"

# confusion matrix
lda.class2 <-lda.pred2$class
table(lda.class2, Direction_test)

##           Direction_test
## lda.class2 Down Up
##       Down    9  5
##       Up     34 56

# mean 
mean(lda.class2==Direction_test)

## [1] 0.625

qda.fit2 <- qda(Direction ~ Lag1*Lag2, data=Weekly, subset = train)
qda.fit2

## Call:
## qda(Direction ~ Lag1 * Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##              Lag1        Lag2  Lag1:Lag2
## Down  0.289444444 -0.03568254 -0.8014495
## Up   -0.009213235  0.26036581 -0.1393632

# predict
qda.class2=predict(qda.fit2, Weekly_test)$class
# confusion matrix
table(qda.class2,Direction_test)

##           Direction_test
## qda.class2 Down Up
##       Down   23 36
##       Up     20 25

# mean 
mean(qda.class2==Direction_test)

## [1] 0.4615385

library(class)
train.X2 <- cbind(Lag2[train])
test.X2<- cbind(Lag2[!train])
train.Direction2 <- Direction[train]
# seed must be set in order to ensure reproducibility of results
set.seed(1)
knn.pred2 <- knn(train.X2, test.X2, train.Direction, k=10)
table(knn.pred2, Direction_test)

##          Direction_test
## knn.pred2 Down Up
##      Down   17 21
##      Up     26 40

mean(knn.pred2==Direction_test)

## [1] 0.5480769

Problem 11

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

library(ISLR)
attach(Auto)
str(Auto)

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median

mpg01 <- rep(0, length(Auto$mpg))
mpg01[Auto$mpg > median(Auto$mpg)] <- 1
Auto <- data.frame(Auto, mpg01)
summary(Auto)

##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365  
##      mpg01    
##  Min.   :0.0  
##  1st Qu.:0.0  
##  Median :0.5  
##  Mean   :0.5  
##  3rd Qu.:1.0  
##  Max.   :1.0  
##

(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01?

library(corrplot)
cor(Auto[, -9])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
## mpg01         0.8369392 -0.7591939   -0.7534766 -0.6670526 -0.7577566
##              acceleration       year     origin      mpg01
## mpg             0.4233285  0.5805410  0.5652088  0.8369392
## cylinders      -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement   -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower     -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight         -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration    1.0000000  0.2903161  0.2127458  0.3468215
## year            0.2903161  1.0000000  0.1815277  0.4299042
## origin          0.2127458  0.1815277  1.0000000  0.5136984
## mpg01           0.3468215  0.4299042  0.5136984  1.0000000

M<-cor(Auto[,-9])
corrplot.mixed(M,lower.col='grey',order = "hclust")

(c) Split the data into a training set and a test set.

set.seed(1)
train <- sample(1:dim(Auto)[1], dim(Auto)[1]*.7, rep=FALSE)
test <- -train
training_data<- Auto[train, ]
testing_data= Auto[test, ]
mpg01.test <- mpg01[test]

(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01

lda_model <- lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = training_data)
lda_model

## Call:
## lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = training_data)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4927007 0.5072993 
## 
## Group means:
##   cylinders   weight displacement horsepower
## 0  6.777778 3611.052     271.9333  129.13333
## 1  4.187050 2342.165     116.8129   79.27338
## 
## Coefficients of linear discriminants:
##                        LD1
## cylinders    -0.3962357999
## weight       -0.0008321338
## displacement -0.0047630097
## horsepower    0.0061919395

lda_pred = predict(lda_model, testing_data)
names(lda_pred)

## [1] "class"     "posterior" "x"

# compute the confusion matrix
pred.lda <- predict(lda_model, testing_data)
table(pred.lda$class, mpg01.test)

##    mpg01.test
##      0  1
##   0 50  3
##   1 11 54

mean(pred.lda$class != mpg01.test)

## [1] 0.1186441

(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01

qda_model = qda(mpg01 ~ cylinders + horsepower + weight + acceleration, data=training_data)
qda_model

## Call:
## qda(mpg01 ~ cylinders + horsepower + weight + acceleration, data = training_data)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4927007 0.5072993 
## 
## Group means:
##   cylinders horsepower   weight acceleration
## 0  6.777778  129.13333 3611.052     14.71556
## 1  4.187050   79.27338 2342.165     16.57338

# compute the confusion matrix
qda.class=predict(qda_model, testing_data)$class
table(qda.class, testing_data$mpg01)

##          
## qda.class  0  1
##         0 52  5
##         1  9 52

mean(qda.class != testing_data$mpg01)

## [1] 0.1186441

(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01

glm_model <- glm(mpg01 ~ cylinders + weight + displacement + horsepower, data = training_data, family = binomial)
summary(glm_model)

## 
## Call:
## glm(formula = mpg01 ~ cylinders + weight + displacement + horsepower, 
##     family = binomial, data = training_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4794  -0.1963   0.1056   0.3508   3.3756  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  11.725290   2.147421   5.460 4.76e-08 ***
## cylinders     0.056770   0.419131   0.135   0.8923    
## weight       -0.001931   0.000817  -2.364   0.0181 *  
## displacement -0.014718   0.009904  -1.486   0.1373    
## horsepower   -0.041518   0.017821  -2.330   0.0198 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 379.79  on 273  degrees of freedom
## Residual deviance: 144.49  on 269  degrees of freedom
## AIC: 154.49
## 
## Number of Fisher Scoring iterations: 7

# predict
probs <- predict(glm_model, testing_data, type = "response")
pred.glm <- rep(0, length(probs))
pred.glm[probs > 0.5] <- 1
table(pred.glm, mpg01.test)

##         mpg01.test
## pred.glm  0  1
##        0 53  3
##        1  8 54

# accuracy
mean(pred.glm != mpg01.test)

## [1] 0.09322034

Problem 13

Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors.

library(MASS)
attach(Boston)

Explore the data

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

pairs(Boston)

Create response variable

crim01 <- rep(0, length(Boston$crim))
crim01[Boston$crim > median(Boston$crim)] <- 1
Boston <- data.frame(Boston, crim01)
summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv           crim01   
##  Min.   : 1.73   Min.   : 5.00   Min.   :0.0  
##  1st Qu.: 6.95   1st Qu.:17.02   1st Qu.:0.0  
##  Median :11.36   Median :21.20   Median :0.5  
##  Mean   :12.65   Mean   :22.53   Mean   :0.5  
##  3rd Qu.:16.95   3rd Qu.:25.00   3rd Qu.:1.0  
##  Max.   :37.97   Max.   :50.00   Max.   :1.0

Create training and testing data

set.seed(1234)
train <- sample(1:dim(Boston)[1], dim(Boston)[1]*.7, rep=FALSE)
test <- -train
Boston.train <- Boston[train, ]
Boston.test <- Boston[test, ]
crim01.test <- crim01[test]

Fit a logistic model

fit.glm1 <- glm(crim01 ~ . - crim01 - crim, data = Boston, family = binomial, subset = train)
fit.glm1

## 
## Call:  glm(formula = crim01 ~ . - crim01 - crim, family = binomial, 
##     data = Boston, subset = train)
## 
## Coefficients:
## (Intercept)           zn        indus         chas          nox           rm  
##  -38.403844    -0.088513    -0.061222     1.575716    54.122064    -0.599424  
##         age          dis          rad          tax      ptratio        black  
##    0.016070     0.915126     0.680109    -0.005248     0.313389    -0.010162  
##       lstat         medv  
##    0.073570     0.204201  
## 
## Degrees of Freedom: 353 Total (i.e. Null);  340 Residual
## Null Deviance:       490.7 
## Residual Deviance: 139.8     AIC: 167.8

# predict
probs <- predict(fit.glm1, Boston.test, type = "response")
pred.glm <- rep(0, length(probs))
pred.glm[probs > 0.5] <- 1
table(pred.glm, crim01.test)

##         crim01.test
## pred.glm  0  1
##        0 67 12
##        1  8 65

# accuracy
mean(pred.glm != crim01.test)

## [1] 0.1315789

DATA MINING HW3

E. Falcon

2/27/2020

Chapter 04 (page 168): 10, 11, 13

Problem 10

Problem 11

Problem 13