ISLR Chapter 4 Classification, 4.8 Exercises 13,14, and 16

library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.4.3

library(MASS)

## Warning: package 'MASS' was built under R version 4.4.3

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:ISLR2':
## 
##     Boston

library(class)

## Warning: package 'class' was built under R version 4.4.3

library(e1071)

## Warning: package 'e1071' was built under R version 4.4.3

data("Weekly")

Question 13

This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of2010.

a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

The Weekly dataset contains weekly stock market data from 1990 to 2010. Most lag variables (Lag1 to Lag5) have average values close to zero and show no strong trends. The “Volume” variable stands out, as it has increased steadily over time, which is also shown by its high correlation with the “Year” (correlation = 0.84). However, the lag variables have weak correlations with “Direction,” meaning they don’t strongly predict whether the market goes up or down. Overall, no obvious patterns appear except for the upward trend in trading volume.

summary(Weekly)

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
##

cor(Weekly[, -9])

##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

plot(Weekly$Volume, type = "l", main = "Volume Over Time", ylab = "Volume", xlab = "Week")

pairs(Weekly[, 2:6], main = "Lag Variables Pairs Plot", col = ifelse(Weekly$Direction == "Up", "blue", "red"))

b) Use the full data set to perform a logistic regression with Direction as the response and the fve lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically signifcant? If so, which ones?

A logistic regression model was built using Lag1 to Lag5 and Volume to predict the market Direction. Based on the output, only Lag2 was statistically significant, with a p-value of 0.0296. This suggests Lag2 has some predictive value. The intercept was also significant. All other variables—Lag1, Lag3, Lag4, Lag5, and Volume—had p-values much higher than 0.05, meaning they were not statistically significant and likely don’t help the model much.

glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
               data = Weekly, family = binomial)
summary(glm.fit)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

The confusion matrix shows that the logistic regression model predicted “Up” much more often than “Down.” It correctly predicted 557 “Up” days and only 54 “Down” days, but it missed 430 “Down” days and 48 “Up” days. The overall accuracy was about 56.1%, which is slightly better than random guessing. This suggests the model tends to favor predicting “Up,” even when it’s wrong, leading to many false positives for that class.

glm.probs <- predict(glm.fit, type = "response")
glm.pred <- ifelse(glm.probs > 0.5, "Up", "Down")


table(Predicted = glm.pred, Actual = Weekly$Direction)

##          Actual
## Predicted Down  Up
##      Down   54  48
##      Up    430 557

mean(glm.pred == Weekly$Direction)

## [1] 0.5610652

d) Now ft the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

A simpler logistic regression model was created using only Lag2 as the predictor and trained on data from 1990 to 2008. When tested on the 2009–2010 data, the model achieved an improved accuracy of 62.5%. The confusion matrix showed better balance than the full model, indicating that this smaller model made fewer mistakes and was more reliable than the one with all variables.

train <- Weekly$Year <= 2008
test <- Weekly$Year > 2008

glm.fit2 <- glm(Direction ~ Lag2, data = Weekly, subset = train, family = binomial)
glm.probs2 <- predict(glm.fit2, Weekly[!train, ], type = "response")
glm.pred2 <- ifelse(glm.probs2 > 0.5, "Up", "Down")

table(Predicted = glm.pred2, Actual = Weekly$Direction[!train])

##          Actual
## Predicted Down Up
##      Down    9  5
##      Up     34 56

mean(glm.pred2 == Weekly$Direction[!train])

## [1] 0.625

e) Repeat (d) using LDA.

Linear Discriminant Analysis (LDA) was applied using only Lag2 as the predictor, trained on data from 1990 to 2008. When tested on the 2009–2010 data, the model achieved an accuracy of about 62.5%, which was similar to the logistic regression model. This suggests that both LDA and logistic regression perform equally well when using Lag2 alone to predict the market direction.

lda.fit <- lda(Direction ~ Lag2, data = Weekly, subset = train)
lda.pred <- predict(lda.fit, Weekly[!train, ])
table(Predicted = lda.pred$class, Actual = Weekly$Direction[!train])

##          Actual
## Predicted Down Up
##      Down    9  5
##      Up     34 56

mean(lda.pred$class == Weekly$Direction[!train])

## [1] 0.625

f) Repeat (d) using QDA.

QDA was used with Lag2 as the only predictor, trained on data from 1990 to 2008. On the 2009–2010 test data, it achieved an accuracy of 58.7%. The confusion matrix showed that the model predicted every week as “Up,” resulting in zero correct predictions for weeks when the market actually went down. This means the model was biased and performed worse than LDA and logistic regression.

qda.fit <- qda(Direction ~ Lag2, data = Weekly, subset = train)
qda.pred <- predict(qda.fit, Weekly[test, ])
table(qda.pred$class, Weekly$Direction[test])
mean(qda.pred$class == Weekly$Direction[test])

##       
##        Down Up
##   Down    0  0
##   Up     43 61
## [1] 0.5865385

g) Repeat (d) using KNN with K = 1.

KNN with K = 1 didn’t work well. It achieved 50% accuracy, which is no better than random guessing, indicating that K = 1 was too low.

train.X <- data.frame(Lag2 = Weekly$Lag2[train])
test.X <- data.frame(Lag2 = Weekly$Lag2[!train])
train.Direction <- Weekly$Direction[train]

set.seed(1)
knn.pred <- knn(train.X, test.X, train.Direction, k = 1)
table(Predicted = knn.pred, Actual = Weekly$Direction[!train])

##          Actual
## Predicted Down Up
##      Down   21 30
##      Up     22 31

mean(knn.pred == Weekly$Direction[!train])

## [1] 0.5

h) Repeat (d) using naive Bayes.

Naive Bayes was used with Lag2 as the predictor and trained on data from 1990 to 2008. On the 2009–2010 test set, it predicted “Up” for every week, just like QDA. The confusion matrix showed that it got 61 weeks right and missed all 43 “Down” weeks, resulting in an overall accuracy of 58.7%. Although Naive Bayes usually performs well, in this case it made the same one-sided mistake as QDA.

nb.fit <- naiveBayes(Direction ~ Lag2, data = Weekly, subset = train)
nb.pred <- predict(nb.fit, Weekly[!train, ])
table(Predicted = nb.pred, Actual = Weekly$Direction[!train])

##          Actual
## Predicted Down Up
##      Down    0  0
##      Up     43 61

mean(nb.pred == Weekly$Direction[!train])

## [1] 0.5865385

i) Which of these methods appears to provide the best results on this data?

Among all the models tested, logistic regression with Lag2 gave the best performance with an accuracy of 62.5%. LDA had similar results, while QDA and Naive Bayes performed worse by predicting only “Up” and missing all “Down” weeks. KNN with K = 1 had the lowest accuracy at just 50%. Overall, logistic regression was the most reliable and balanced method for this data.

j) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifer.ders + origin, data = Auto) summary(sq_model)

To improve the model, interaction terms and additional predictors were tested. A logistic regression model using Lag1 * Lag2, Lag3, and Volume slightly improved performance, with Lag2 remaining statistically significant. However, the interaction term Lag1:Lag2 was not significant. On the test data, this model achieved an accuracy of 63.5%, which was better than the model using Lag2 alone. In KNN, increasing K to 3 or 5 also improved accuracy compared to K = 1, showing that a moderate K-value helps reduce overfitting. Overall, the best result came from logistic regression with selected variables and interactions, making it the most effective approach for predicting market direction in this dataset.

glm.fit3 <- glm(Direction ~ Lag1 * Lag2 + Lag3 + Volume, data = Weekly, family = binomial)
summary(glm.fit3)

## 
## Call:
## glm(formula = Direction ~ Lag1 * Lag2 + Lag3 + Volume, family = binomial, 
##     data = Weekly)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.255045   0.085134   2.996  0.00274 **
## Lag1        -0.035440   0.028143  -1.259  0.20793   
## Lag2         0.057488   0.026829   2.143  0.03213 * 
## Lag3        -0.015122   0.026497  -0.571  0.56819   
## Volume      -0.019599   0.036768  -0.533  0.59400   
## Lag1:Lag2    0.002407   0.006757   0.356  0.72171   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1487.6  on 1083  degrees of freedom
## AIC: 1499.6
## 
## Number of Fisher Scoring iterations: 4

Question 14

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may fnd it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

A new column called mpg01 was created to indicate whether a car has high or low gas mileage. It was one of the miles per gallon that were above the median and zero otherwise.

data("Auto")

Auto$mpg01 <- ifelse(Auto$mpg > median(Auto$mpg), 1, 0)

table(Auto$mpg01)

## 
##   0   1 
## 196 196

b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

Graphs showed that cars with high MPG usually have lower weight, less horsepower, smaller engines, and fewer cylinders. These features are the most useful for prediction.

par(mfrow = c(2, 3))
boxplot(displacement ~ mpg01, data = Auto, main = "Displacement")
boxplot(horsepower ~ mpg01, data = Auto, main = "Horsepower")
boxplot(weight ~ mpg01, data = Auto, main = "Weight")
boxplot(acceleration ~ mpg01, data = Auto, main = "Acceleration")
boxplot(cylinders ~ mpg01, data = Auto, main = "Cylinders")
boxplot(origin ~ mpg01, data = Auto, main = "Origin")

c) Split the data into a training set and a test set.

The data was split into training and test sets so the models could learn from one part and be tested on the other. This helps measure real performance.

set.seed(1)
train_idx <- sample(1:nrow(Auto), nrow(Auto) * 0.7)
train <- Auto[train_idx, ]
test <- Auto[-train_idx, ]

d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

Linear Discriminant Analysis (LDA) was used to predict mpg01 using the variables most related to fuel efficiency, such as weight, horsepower, and displacement. The model was trained on the training set and tested on the test set, where it achieved an accuracy of approximately 88.1%, with a test error rate of 11.9%. The confusion matrix showed that the model correctly classified most observations, making it a strong performer for this task.

lda.fit <- lda(mpg01 ~ cylinders + weight + horsepower + displacement, data = train)
lda.pred <- predict(lda.fit, test)
table(Predicted = lda.pred$class, Actual = test$mpg01)

##          Actual
## Predicted  0  1
##         0 50  3
##         1 11 54

mean(lda.pred$class != test$mpg01)

## [1] 0.1186441

e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

Quadratic Discriminant Analysis (QDA) was applied to predict mpg01 using key features like weight, horsepower, and displacement. On the test set, the model achieved an accuracy of 88.1%, resulting in a test error rate of 11.9%, the same as LDA. The confusion matrix showed that QDA correctly classified most cases, slightly improving on false positives compared to LDA, and performed well overall.

qda.fit <- qda(mpg01 ~ cylinders + weight + horsepower + displacement, data = train)
qda.pred <- predict(qda.fit, test)
table(Predicted = qda.pred$class, Actual = test$mpg01)

##          Actual
## Predicted  0  1
##         0 52  5
##         1  9 52

mean(qda.pred$class != test$mpg01)

## [1] 0.1186441

f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

Logistic regression was used to predict mpg01 with the most relevant variables, including weight, horsepower, and displacement. The model performed very well on the test set, achieving an accuracy of 90.7% and a test error rate of 9.3%. The confusion matrix showed it correctly predicted both high and low MPG cars, making it the best performing model so far.

glm.fit <- glm(mpg01 ~ cylinders + weight + horsepower + displacement,
               data = train, family = binomial)
glm.probs <- predict(glm.fit, test, type = "response")
glm.pred <- ifelse(glm.probs > 0.5, 1, 0)
table(Predicted = glm.pred, Actual = test$mpg01)

##          Actual
## Predicted  0  1
##         0 53  3
##         1  8 54

mean(glm.pred != test$mpg01)

## [1] 0.09322034

g)Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

Naive Bayes was used to predict mpg01 using the most important features related to fuel efficiency. On the test set, the model achieved an accuracy of 88.98%, with a test error rate of approximately 11.0%. The confusion matrix showed it made a few more mistakes than logistic regression but still performed very well overall.

nb.fit <- naiveBayes(as.factor(mpg01) ~ cylinders + weight + horsepower + displacement, data = train)
nb.pred <- predict(nb.fit, test)
table(Predicted = nb.pred, Actual = test$mpg01)

##          Actual
## Predicted  0  1
##         0 52  4
##         1  9 53

mean(nb.pred != test$mpg01)

## [1] 0.1101695

h) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

K-Nearest Neighbors (KNN) was used to predict mpg01 using key features. Different values of K were tested. For K = 1 and K = 3, the test error rate was 14.4%, while K = 5 gave a slightly better result with a test error rate of 13.6%. This shows that using a slightly larger K helps improve accuracy by reducing overfitting.

train.X <- scale(train[, c("cylinders", "weight", "horsepower", "displacement")])
test.X <- scale(test[, c("cylinders", "weight", "horsepower", "displacement")])
train.Y <- train$mpg01


set.seed(1)
for (k in c(1, 3, 5)) {
  knn.pred <- knn(train.X, test.X, train.Y, k = k)
  err <- mean(knn.pred != test$mpg01)
  cat("Test error for k =", k, ":", round(err, 3), "\n")
}

## Test error for k = 1 : 0.144 
## Test error for k = 3 : 0.144 
## Test error for k = 5 : 0.136

Question 16

Using the Boston data set, ft classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings. Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.

To find out if a neighborhood in Boston has a high crime rate, a new variable called crim01 was created. This variable is 1 if the crime rate is above the median, and 0 if it’s below. The anaysis used a few features from the dataset that are related to neighborhood conditions:

nox: air pollution
rm: number of rooms in homes
dis: distance to work centers
tax: property tax
lstat: percentage of low-income residents

Then the data was split into 70% for training and 30% for testing, then tried different models to see which could best predict high or low crime.

Prepare the Data

data("Boston")

Boston$crim01 <- ifelse(Boston$crim > median(Boston$crim), 1, 0)

Boston <- Boston[, !names(Boston) %in% "crim"]

Split into Training and Test Sets

set.seed(1)
train_idx <- sample(1:nrow(Boston), nrow(Boston) * 0.7)
train <- Boston[train_idx, ]
test <- Boston[-train_idx, ]

Logistic Regression

A logistic regression model was trained using predictors like nox, rm, dis, tax, and lstat. On the test set, it correctly classified most neighborhoods, with 61 true negatives and 64 true positives. It made 27 errors total (15 false positives and 12 false negatives), resulting in a test error rate of about 17.8%, or 82.2% accuracy. The model showed that higher air pollution (nox) and property tax (tax) were linked to more crime, while more rooms and distance from city centers were associated with less crime. The model performed well and was easy to interpret.

glm.fit <- glm(crim01 ~ nox + rm + dis + tax + lstat, data = train, family = binomial)
glm.probs <- predict(glm.fit, test, type = "response")
glm.pred <- ifelse(glm.probs > 0.5, 1, 0)
table(Predicted = glm.pred, Actual = test$crim01)

##          Actual
## Predicted  0  1
##         0 61 15
##         1 12 64

mean(glm.pred != test$crim01)

## [1] 0.1776316

LDA

The LDA model used the same predictors as logistic regression and performed well on the test data. It correctly predicted 68 low-crime and 60 high-crime neighborhoods, while making 24 mistakes (19 false positives and 5 false negatives). This gave a test error rate of about 15.8%, slightly better than logistic regression. LDA did a good job overall, especially at correctly identifying low-crime areas.

lda.fit <- lda(crim01 ~ nox + rm + dis + tax + lstat, data = train)
lda.pred <- predict(lda.fit, test)
table(Predicted = lda.pred$class, Actual = test$crim01)

##          Actual
## Predicted  0  1
##         0 68 19
##         1  5 60

mean(lda.pred$class != test$crim01)

## [1] 0.1578947

Naive Bayes

The Naive Bayes model also used the same predictors and gave strong results. It correctly predicted 64 low-crime and 64 high-crime neighborhoods, with only 24 total errors (15 false positives and 9 false negatives). The test error rate was about 15.8%, matching the performance of LDA. This model was simple, fast, and accurate, making it a reliable choice for this classification task.

nb.fit <- naiveBayes(as.factor(crim01) ~ nox + rm + dis + tax + lstat, data = train)
nb.pred <- predict(nb.fit, test)
table(Predicted = nb.pred, Actual = test$crim01)

##          Actual
## Predicted  0  1
##         0 64 15
##         1  9 64

mean(nb.pred != test$crim01)

## [1] 0.1578947

KNN

The KNN model was tested using several values of K. The best performance came from K = 3, with a test error rate of only 6.6%, the lowest among all models. K = 1 also performed well with an error of 8.6%, while K = 5 and K = 10 had slightly higher errors of 9.2%. These results show that KNN is highly accurate when the right K value is chosen, especially with properly scaled data. However, KNN can be harder to interpret and is more sensitive to noise or irrelevant features.

train.X <- scale(train[, c("nox", "rm", "dis", "tax", "lstat")])
test.X <- scale(test[, c("nox", "rm", "dis", "tax", "lstat")])
train.Y <- train$crim01

for (k in c(1, 3, 5, 10)) {
  knn.pred <- knn(train.X, test.X, train.Y, k = k)
  err <- mean(knn.pred != test$crim01)
  cat("Test error for k =", k, ":", round(err, 3), "\n")
}

## Test error for k = 1 : 0.086 
## Test error for k = 3 : 0.066 
## Test error for k = 5 : 0.092 
## Test error for k = 10 : 0.092

Conclusion

In this analysis, several classification models were used to predict whether a Boston neighborhood has a high or low crime rate. All models used the same set of predictors related to housing and environment: nox, rm, dis, tax, and lstat.

KNN with K = 3 gave the best performance, with a test error rate of just 6.6%.
Naive Bayes and LDA both performed well with identical test error rates of 15.8%.
Logistic regression also did well, with a slightly higher error of 17.8%, but was easy to interpret and reliable.

While KNN was the most accurate, it is more complex to tune and interpret. Logistic regression, LDA, and Naive Bayes offer a strong balance between simplicity and performance. Overall, this shows that crime in Boston neighborhoods can be predicted with good accuracy using a few key features related to housing and socio-economic conditions.