Problem 10

Use Weekly data set in ISLR package.

library(ISLR)
## Warning: package 'ISLR' was built under R version 4.1.2
data(Weekly)

Part A

Numerical and Graphical summaries to show any patterns.

Names function to see variables:

names(Weekly)
## [1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
## [7] "Volume"    "Today"     "Direction"

Pairs function to visually see variable relationships:

pairs(Weekly)

Correlation Matrix to numerically see variable relationships:

cor(Weekly[,-9])
##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000
corrplot(cor(Weekly[,-9]), method = 'circle')

Interpretation

The pairs function and correlation matrix show that the only variables that have a relationship are “Volume” and “Year”. All others are low and close to 0 meaning there is no relationship.

Plot of Volume and Year:

plot(Weekly$Volume)

Interpretation

The plot function shows that Volume (average number of shares traded) increases over time.

Part B

Logistic Regression with “Direction” as response and all other as predictors. Identify which predictors appear to be significant.

Logistic Regression

glm_fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(glm_fit)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Interpretation

The variable “Lag2” has a low p-value (0.0296), and therefore has a statistically significant relationship with “Direction”. However, all the other variables have a higher p-value and therefore no significant relationship to “Direction”.

Part C

Confusion Matrix and overall Fraction of Correct Predictions.

Confusion Matrix

glm_prob <- predict(glm_fit, type = 'response')
glm_pred <- rep("Down", 1089)
glm_pred[glm_prob > 0.5] = "Up"
table(glm_pred, Weekly$Direction)
##         
## glm_pred Down  Up
##     Down   54  48
##     Up    430 557
mean(glm_pred == Weekly$Direction)
## [1] 0.5610652

Interpretation

Based on the confusion matrix, the accuracy rate of the model is 56.12%.

Part D

Fit the regression model for 1990 to 2008 with “Lag2” as the predictor. Compute Confusion Matrix and overall fraction of correct predictions for the data not in the model (2009 and 2010).

train <- (Weekly$Year < 2009)
Weekly_2009 <- Weekly[!train,]
dim(Weekly)
## [1] 1089    9
dim(Weekly_2009)
## [1] 104   9
Direction_2009 <- Weekly$Direction[!train]
glm_fit_2 <- glm(Direction ~ Lag2, data = Weekly, subset = train, family = binomial)
glm_prob_2 <- predict(glm_fit_2, Weekly_2009, type = 'response')
glm_pred_2 <- rep("Down", 104)
glm_pred_2[glm_prob_2 > 0.5] = "Up"
table(glm_pred_2, Direction_2009)
##           Direction_2009
## glm_pred_2 Down Up
##       Down    9  5
##       Up     34 56
mean(glm_pred_2 == Direction_2009)
## [1] 0.625

Interpretation

Based on the confusion matrix, the accuracy rate of the model is 62.5%.

Part E

Repeat Part D using LDA model.

library(MASS)
## Warning: package 'MASS' was built under R version 4.1.2
lda_fit <- lda(Direction ~ Lag2, data = Weekly, subset = train)
lda_fit
## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162
plot(lda_fit)

lda_pred <- predict(lda_fit, Weekly_2009)
names(lda_pred)
## [1] "class"     "posterior" "x"
lda_class <- lda_pred$class
table(lda_class, Direction_2009)
##          Direction_2009
## lda_class Down Up
##      Down    9  5
##      Up     34 56
mean(lda_class == Direction_2009)
## [1] 0.625

Interpretation

Based on the confusion matrix, the accuracy rate of this model is also 62.5%. This model also shows that the market went down 44.77% of the days and up 55.23% of the days. Group means show that the previous Lag2 Day trends negative when the market decreases and positive when it increases.

Part F

Repeat Part D using QDA model.

qda_fit <- qda(Direction ~ Lag2, data = Weekly, subset = train)
qda_fit
## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
qda_pred <- predict(qda_fit, Weekly_2009)
names(qda_pred)
## [1] "class"     "posterior"
qda_class <- qda_pred$class
table(qda_class, Direction_2009)
##          Direction_2009
## qda_class Down Up
##      Down    0  0
##      Up     43 61
mean(qda_class == Direction_2009)
## [1] 0.5865385

Interpretation

Based on the confusion matrix, the accuracy rate of this model is 58.65%. This model also also shows that the market went down 44.77% of the days and up 55.23% of the days. Group means show that the previous Lag2 Day trends negative when the market decreases and positive when it increases.

Part G

Repeat Part D using KNN with K = 1.

library(class)
train_X <- as.matrix(Weekly$Lag2[train])
test_X <- as.matrix(Weekly$Lag2[!train])
train_Direction <- Weekly$Direction[train]
length(train_X)
## [1] 985
length(test_X)
## [1] 104
length(train_Direction)
## [1] 985
set.seed(1)
knn_pred <- knn(train_X, test_X, train_Direction, k = 1)
table(knn_pred, Direction_2009)
##         Direction_2009
## knn_pred Down Up
##     Down   21 30
##     Up     22 31
mean(knn_pred == Direction_2009)
## [1] 0.5

Interpretation

Based on the confusion matrix, the accuracy rate of this model is 50%.

Part H

Which method appears to produce the best results from the data?

Logistic Regression and Linear Discriminant Analysis (LDA) have the highest accuracy rates for the data models. Both are 62.5% accurate.

Part I

Experiment with different combinations

Logistic Regression with “Lag2” and “Lag2”:“Lag5”

glm_fit_3 <- glm(Direction ~ Lag2:Lag5 + Lag2, data = Weekly, family = binomial, subset = train)
glm_prob_3 <- predict(glm_fit_3, Weekly_2009, type = 'response')
glm_pred_3 <- rep("Down", length(glm_prob_3))
glm_pred_3[glm_prob_3 > 0.5] = "Up"
Direction_2009 = Weekly$Direction[!train]
table(glm_pred_3, Direction_2009)
##           Direction_2009
## glm_pred_3 Down Up
##       Down    9  4
##       Up     34 57
mean(glm_pred_3 == Direction_2009)
## [1] 0.6346154

KNN = 200

train_X_2 <- as.matrix(Weekly$Lag2[train])
test_X_2 <- as.matrix(Weekly$Lag2[!train])
train_Direction_2 <- Weekly$Direction[train]

set.seed(1)
knn_pred_2 <- knn(train_X_2, test_X_2, train_Direction_2, k = 200)
table(knn_pred_2, Direction_2009)
##           Direction_2009
## knn_pred_2 Down Up
##       Down    2  0
##       Up     41 61
mean(knn_pred_2 == Direction_2009)
## [1] 0.6057692

LDA with “Lag2” and “Lag2”:“Lag5”

lda_fit_2 <- lda(Direction ~ Lag2:Lag5 + Lag2, data = Weekly, family = binomial, subset = train)
lda_pred_2 <- predict(lda_fit_2, Weekly_2009)
table(lda_pred_2$class, Direction_2009)
##       Direction_2009
##        Down Up
##   Down    9  4
##   Up     34 57
mean(lda_pred_2$class == Direction_2009)
## [1] 0.6346154

QDA with “Lag2” and “Lag2”:“Lag5”

qda_fit_2 <- qda(Direction ~ Lag2:Lag5 + Lag2, data = Weekly, family = binomial, subset = train)
qda_pred_2 <- predict(qda_fit_2, Weekly_2009)
table(qda_pred_2$class, Direction_2009)
##       Direction_2009
##        Down Up
##   Down    5 13
##   Up     38 48
mean(qda_pred_2$class == Direction_2009)
## [1] 0.5096154

Interpretation

Based on above experiments, Logistic Regression and LDA with “Lag2” and “Lag2”:“Lag5” variables seem to be the most accurate models both with 63.46% accuracy rates.

Problem 11

Use Auto data set to develop model for predicting whether a car gets high or low gas mileage.

library(ISLR)
attach(Auto)
summary(Auto)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365

Part A

Create binary variable for “mpg”.

mpg01 <- rep(0, length(mpg))
mpg01[mpg > median(mpg)] = 1
Auto <- data.frame(Auto, mpg01)
summary(Auto)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365  
##      mpg01    
##  Min.   :0.0  
##  1st Qu.:0.0  
##  Median :0.5  
##  Mean   :0.5  
##  3rd Qu.:1.0  
##  Max.   :1.0  
## 

Part B

Explore the data graphically.

pairs(Auto)

cor(Auto[,-9])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
## mpg01         0.8369392 -0.7591939   -0.7534766 -0.6670526 -0.7577566
##              acceleration       year     origin      mpg01
## mpg             0.4233285  0.5805410  0.5652088  0.8369392
## cylinders      -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement   -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower     -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight         -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration    1.0000000  0.2903161  0.2127458  0.3468215
## year            0.2903161  1.0000000  0.1815277  0.4299042
## origin          0.2127458  0.1815277  1.0000000  0.5136984
## mpg01           0.3468215  0.4299042  0.5136984  1.0000000
corrplot(cor(Auto[,-9]), method = 'circle')

Interpretation

First, the pairs function shows any relationships between variables. The correlation matrix shows it in more detail. “mpg01” has a positive relationship with: “acceleration”, “origin”, and “year” (high values) “mpg01” has a negative relationship with: “horsepower”, “displacement”, “cylinders”, and “weight” (negative values)

Part C

Split the data into training and testing sets.

auto_train <- (year %% 2 == 0)
training <- Auto[train,]
testing <- Auto[-train,]

Part D

LDA on Training model to predict “mpg01” using variables from Part B. What is the test error?

a_lda_fit <- lda(mpg01 ~ displacement + horsepower + weight + year + cylinders + origin, data = training)
a_lda_pred <- predict(a_lda_fit, testing)
table(a_lda_pred$class, testing$mpg01)
##    
##       0   1
##   0 166   7
##   1  29 189
mean(a_lda_pred$class != testing$mpg01)
## [1] 0.09207161

Interpretation

LDA model has 92.07% error rate.

Part E

QDA on training model to predict “mpg01” using variables from Part B. What is the test error?

a_qda_fit <- qda(mpg01 ~ displacement + horsepower + weight + year + cylinders + origin, data = training)
a_qda_pred <- predict(a_qda_fit, testing)
table(a_qda_pred$class, testing$mpg01)
##    
##       0   1
##   0 175  14
##   1  20 182
mean(a_qda_pred$class != testing$mpg01)
## [1] 0.08695652

Interpretation

QDA model has 8.69% error rate.

Part F

Logistic Regression on training model to predict “mpg01” using variables from Part B. What is the test error?

a_glm_fit <- glm(mpg01 ~ displacement + horsepower + weight + year + cylinders + origin, data = training, family = binomial)
a_glm_prob <- predict(a_glm_fit, testing, type = 'response')
a_glm_pred <- rep(0, length(a_glm_prob))
a_glm_pred[a_glm_prob > 0.5] = 1
table(a_glm_pred, testing$mpg01)
##           
## a_glm_pred   0   1
##          0 172  13
##          1  23 183
mean(a_glm_pred != testing$mpg01)
## [1] 0.09207161

Interpretation

Logistic Regression model has 9.21% error rate.

Part G

KNN to predict “mpg01” with variables from Part B. What are the test errors? Which K Values perform best?

KNN_auto_train <- cbind(displacement,horsepower,weight,cylinders,year,origin)[auto_train,] KNN_auto_test <- cbind(displacement,horsepower,weight,cylinders,year,origin)[-auto_train,] set.seed(1) KNN_auto_pred <- knn(KNN_auto_train, KNN_auto_test, training\(mpg01, k=1) mean(KNN_auto_pred != testing\)mpg01)

detach(Auto)

Problem 13

Use Boston data set to fit classification models to predict whether a suburb has higher or lower crime rate than the median.

Data Set

attach(Boston)
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Crime Rate Variable

crime01 <- rep(0, length(crim))
crime01[crim > median(crim)] <- 1
Boston <- data.frame(Boston, crime01)

Training and Testing Data Sets

train_bos <- 1:(dim(Boston)[1]/2)
test_bos <- (dim(Boston)[1]/2 + 1):dim(Boston)[1]
training_b <- Boston[train_bos, ]
testing_b <- Boston[test_bos, ]
c01testing <- crime01[test_bos]

Visual Representation

pairs(Boston)

cor(Boston)
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
## crime01  0.40939545 -0.43615103  0.60326017  0.070096774  0.72323480
##                  rm         age         dis          rad         tax    ptratio
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867
## crime01 -0.15637178  0.61393992 -0.61634164  0.619786249  0.60874128  0.2535684
##               black      lstat       medv     crime01
## crim    -0.38506394  0.4556215 -0.3883046  0.40939545
## zn       0.17552032 -0.4129946  0.3604453 -0.43615103
## indus   -0.35697654  0.6037997 -0.4837252  0.60326017
## chas     0.04878848 -0.0539293  0.1752602  0.07009677
## nox     -0.38005064  0.5908789 -0.4273208  0.72323480
## rm       0.12806864 -0.6138083  0.6953599 -0.15637178
## age     -0.27353398  0.6023385 -0.3769546  0.61393992
## dis      0.29151167 -0.4969958  0.2499287 -0.61634164
## rad     -0.44441282  0.4886763 -0.3816262  0.61978625
## tax     -0.44180801  0.5439934 -0.4685359  0.60874128
## ptratio -0.17738330  0.3740443 -0.5077867  0.25356836
## black    1.00000000 -0.3660869  0.3334608 -0.35121093
## lstat   -0.36608690  1.0000000 -0.7376627  0.45326273
## medv     0.33346082 -0.7376627  1.0000000 -0.26301673
## crime01 -0.35121093  0.4532627 -0.2630167  1.00000000
corrplot(cor(Boston), method = 'circle')

Interpretation

The pairs and correlation plots show “indus”, “nox”, “age”, “dis”, “rad”, and “tax” have the strongest relationship.

Logistic Regression

set.seed(1)
GLM_B_fit <- glm(crime01 ~ indus + nox + age + dis + rad + tax, data = training_b, family = binomial)
GLM_B_prob <- predict(GLM_B_fit, testing_b, type = "response")
GLM_B_pred <- rep(0, length(GLM_B_prob))
GLM_B_pred[GLM_B_prob > 0.5] = 1
table(GLM_B_pred, c01testing)
##           c01testing
## GLM_B_pred   0   1
##          0  75   8
##          1  15 155
mean(GLM_B_pred != c01testing)
## [1] 0.09090909

Interpretation The confusion matrix predicts that crime will increase 8 values above the median and down 15 below the median. Test error = 9.09%

LDA

lda.boston.fit <- lda(crime01 ~ indus + nox + age + dis + rad + tax, data = training_b, family = binomial)
lda.boston.pred <- predict(lda.boston.fit, testing_b)
table(lda.boston.pred$class, c01testing)
##    c01testing
##       0   1
##   0  81  18
##   1   9 145
mean(lda.boston.pred$class != c01testing)
## [1] 0.1067194

Interpretation

This confusion matrix predicts crime would increase 18 values above the median and decrease 9 values below the median. Test error = 10.67%

KNN K = 1

knn.boston.train <- cbind(indus, nox, age, dis, rad, tax)[train_bos,]
knn.boston.test <- cbind(indus, nox, age, dis, rad, tax)[test_bos,]
set.seed(1)
knn.boston.pred <- knn(knn.boston.train, knn.boston.test, c01testing, k=1)
table(knn.boston.pred, c01testing)
##                c01testing
## knn.boston.pred   0   1
##               0  31 155
##               1  59   8
mean(knn.boston.pred != c01testing)
## [1] 0.8458498

K = 50

knn.boston.pred <- knn(knn.boston.train, knn.boston.test, c01testing, k=50)
mean(knn.boston.pred != c01testing)
## [1] 0.2648221

K = 100

knn.boston.pred <- knn(knn.boston.train, knn.boston.test, c01testing, k=100)
mean(knn.boston.pred != c01testing)
## [1] 0.3043478

Interpretation

This confusion matrix predicts crime rate will increase up to 155 values above the median and decrease 59 below the median. K = 1 Test error: 84.58% | K = 50 Test error: 26.48% | K = 100 Test error: 30.43% | K = 50 has lowest test error After execution of all models, Logistic Regression is the most accurate with only a 9.09% error rate.