Advanced Predictive Modeling HW 1

1. (3+2+3=8 pts) K-nearest Neighbor Classification

Suppose I use a validation set to determine that K=13 is a good choice when applying the K-Nearest Neighbor (KNN) Classifier to a given classification problem.

a. What are two main uses of a validation or holdout dataset. (answer should not include determining K for a KNN classifier!)

The two main uses of a validation or holdout dataset are:

The validation set is used to approximate the expected prediction error that will be observed in the test set. The observations in the training set are already assigned classes (ex: Spam or Ham). New observations are introduced via the validation set and assigned a class based on their “K” Nearest Neighbors. The expected prediction error is approximated by an average over the data in the validation set.

Additionally, using cross-validation gives all of the observations an opportunity to be included in the validation or hold out set. By allowing observations to cross over between the training set and validation set, we maximize the amount of information we get from the data. Overall, methods such a K-fold Cross Validation give us a better idea of how our prediction method will perform on different sets of data and allows us to tune any model parameters accordingly. This approach would not be possible without designating a validation or hold out set beforehand.

b. What is one principal drawback of the K-Nearest Neighbor (KNN) Classifier?

The principal drawback of the K-Nearest Neighbor Classifier is that it does not generalize well and is not robust in the face of noisy data. This is because the KNN algorithm uses all features equally in computing for similarities and assigning classes. When there is only a small subset of features that are useful for classification in the first place, this can lead to an increase in classification errors over more robust methods of classification.

Additionally, as discussed in class, the KNN approach is highly susceptible to “The Curse of Dimensionality” (i.e., Euclidean distances become very noisy for high dimensional problems).

c. If you used K=3 instead of 13 as determined for this problem, then would you more likely be overfitting or underfitting the data? Briefly justify your choice.

K=3 simply means that a new case (data point) is assigned to its 3 nearest neighbors. Likewise, when K=13, a new case will be classified based on the classes of its 13 nearest neighbors. Both intuition and experience tell us that K=3 will most likely overfit the data, creating a more complex model while chasing patterns in the data that will not generalize well. While the training error rate might look good, the inability to generalize well will mostly likely result in a much higher test error rate. K=13 is a better choice although, in some cases, it may underfit the data and introduce bias into the model.

In general, increasing k will decrease variance and increase bias. While decreasing k will increase variance and decrease bias. This is what the bias-variance tradeoff is all about!

2. (5 pts) Maximum Likelihood Estimation

Suppose a manager at an internet sales company wants to estimate how fast his salesperson is generating successful leads. Instead of recording the time for each lead, the time taken to generate the next 5 leads are recorded, i.e., there is one recording (denoting the elapsed time) for every 5 consecutive leads. For a specific salesperson, the time intervals recorded are {1, 3, 1.5, 4, 2, 7, 1.2, 2, 4, 3.1} hours. A statistician suggests that if these time intervals are assumed to arise by i.i.d. sampling from the following distribution:

(where C is a normalizing constant), and if θ can be estimated, then he can provide detailed information about the lead generation process, including average rates, variances etc.

Find the Maximum Likelihood estimate for θ based on the recorded observations.

In general, the method of maximum likelihood selects the set of values of the model parameters that maximizes the “agreement” of the selected model with the observed data. As we discussed in class, the criteria that maximizes the likelihood of a distribution is the sample mean and sample variance. Thus, using the data listed above, the following equation will attempt to estimate theta.

Note: The equation below is abbreviated to illustrate the process of how to calculate MLE and avoid taking up the entire page by repetitively plugging in all values of X (inputs).

3. (4+6=10 pts) Data Exploration and simple MLR with R

Consider the dataset provided (BostonDerived.RData) which is derived from the “Boston Housing” data set, part of the MASS package. This dataset records properties of 506 housing zones in the Greater Boston area. For a description of the original data, see http://archive.ics.uci.edu/ml/datasets/Housing. Typically one is interested in predicting MEDV (median home value) based on other attributes.

(a) Generate box-plots of the LSTAT (% of lower status in the population) and MEDV (median home value) attributes and identify the cutoff values for outliers. Generate a scatterplot of MEDV against LSTAT; comment on how inclusion of the outliers would affect a predictive model of median home value as a function of % of lower status in the population. (Hint: Such effects may be easier to visualize if the outliers are a different color or symbol than the other data.)

load(file="BostonDerived.RData")
attach(Boston)

boxplot(lstat, boxfill = "light gray", outpch = 21:25, outlty = 2, bg = "pink", lwd = 2, medcol = "dark green", medcex = 2, medpch = 20, ylim = c(0, 50), main = "Dispersion in LSTAT Variable w/ Outlier Detection", range = 1.5, cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5)

plot of chunk unnamed-chunk-1

#cut-off values are given by the formal definition of an outlier: 
#Q3 + 1.5*IQR
upper_cut_off1 <- (quantile(lstat, 0.75)) + (IQR(lstat))*1.5
upper_cut_off1

##     75% 
## 31.9625

#Q1 - 1.5*IQR
lower_cut_off1 <- (quantile(lstat, 0.25)) - (IQR(lstat))*1.5
lower_cut_off1

##     25% 
## -8.0575

boxplot(medv, boxfill = "light gray", outpch = 21:25, outlty = 2, bg = "red", lwd = 2, medcol = "dark blue", medcex = 2, medpch = 20, ylim = c(0, 50), main = "Dispersion in MEDV Variable w/ Outlier Detection", range = 1.5, cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5)

plot of chunk unnamed-chunk-1

#cut-off values are given by the formal definition of an outlier: 
#Q3 + 1.5*IQR
upper_cut_off <- (quantile(medv, 0.75)) + (IQR(medv))*1.5
upper_cut_off

##     75% 
## 36.9625

#Q1 - 1.5*IQR
lower_cut_off <- (quantile(medv, 0.25)) - (IQR(medv))*1.5
lower_cut_off

##    25% 
## 5.0625

#identifying outlier values for identification in scatter plot
outliers_lstat = boxplot(lstat, plot=FALSE)$out
outliers_medv = boxplot(medv, plot=FALSE)$out


plot(medv, lstat, col = ifelse(medv %in% outliers_medv, "red", ifelse(lstat %in% outliers_lstat, "green", "blue")), xlab = "Median Home Value", ylab = "% of Lower Status in the Population", main = "MEDV vs. LSTAT w/ Outlier Detection", type = "p", ylim = c(0, 50), lwd = 2, cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5)
abline(lm(medv ~ lstat), lwd = 2)

plot of chunk unnamed-chunk-1

The Effect of Outliers

If we were to model “median home value” as a function of “% of lower status in the population” using OLS, our results would be affected by the outliers highlighted in the scatter plot above. This is because the method of least squares involves minimizing the sum of the squared vertical distances between each data point and the fitted line. If a model is fitted on data containing outliers, the line of best fit will be erroneously pulled towards the outliers and the resulting coefficient estimates will be off.

(b) Let us try to fit an MLR to this dataset, with MEDVDerived as the dependent variable. Keep the first 300 records as a training set (call it Bostrain) which you will use to fit the model; the remaining 206 will be used as a test set (Bostest). Use only the following variables (expressed in “R” form for convenience) in your model: MEDV Derived ∼ LSTAT + RM + CHAS + INDUS + TAX + RAD + BLACK.

Bostrain <- Boston[1:300,]
Bostest <- Boston[301:506,]
names(Boston)

##  [1] "crim"        "zn"          "indus"       "chas"        "nox"        
##  [6] "rm"          "age"         "dis"         "rad"         "tax"        
## [11] "ptratio"     "black"       "lstat"       "medv"        "medvDerived"

lm.fit=lm(medvDerived ~ lstat + rm + chas + indus + tax + rad + black, data = Bostrain)
summary(lm.fit)

## 
## Call:
## lm(formula = medvDerived ~ lstat + rm + chas + indus + tax + 
##     rad + black, data = Bostrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36660 -0.07622 -0.00430  0.07071  0.56637 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.4148850  0.1407009  10.056  < 2e-16 ***
## lstat       -0.0153768  0.0018120  -8.486 1.09e-15 ***
## rm           0.2917835  0.0150922  19.333  < 2e-16 ***
## chas         0.0612613  0.0269690   2.272  0.02384 *  
## indus        0.0013798  0.0015540   0.888  0.37533    
## tax         -0.0005450  0.0001353  -4.027 7.20e-05 ***
## rad          0.0021081  0.0050611   0.417  0.67733    
## black        0.0005371  0.0001855   2.894  0.00408 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1304 on 292 degrees of freedom
## Multiple R-squared:  0.8401, Adjusted R-squared:  0.8362 
## F-statistic: 219.1 on 7 and 292 DF,  p-value: < 2.2e-16

(a) Report the coefficients obtained by your model. Would you prefer to drop any of the variables used in your model (based on the t-scores or p-values)?

Yes, I would prefer to drop the variables that are not deemed statistically significant at the 95% confidence level (indus, rad), as evidenced by both their t-scores and corresponding p-values.

lm.fit=lm(medvDerived ~ lstat + rm + chas + tax + black, data = Bostrain)
MSE = mean(lm.fit$residuals^2)
summary(lm.fit)

## 
## Call:
## lm(formula = medvDerived ~ lstat + rm + chas + tax + black, data = Bostrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36912 -0.07477  0.00037  0.07206  0.56043 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.4217911  0.1401846  10.142  < 2e-16 ***
## lstat       -0.0148977  0.0017241  -8.641 3.64e-16 ***
## rm           0.2917394  0.0149576  19.504  < 2e-16 ***
## chas         0.0656851  0.0264573   2.483  0.01360 *  
## tax         -0.0004837  0.0001160  -4.171 4.00e-05 ***
## black        0.0005117  0.0001830   2.796  0.00551 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1301 on 294 degrees of freedom
## Multiple R-squared:  0.8396, Adjusted R-squared:  0.8369 
## F-statistic: 307.8 on 5 and 294 DF,  p-value: < 2.2e-16

(b) Report the MSE obtained on Bostrain. How much does this increase when you score your model on Bostest?

MSE

## [1] 0.01659244

#Report the MSE obtained on Bostrain
test.pred = predict(lm.fit, Bostest)
MSE1 = mean((Bostest$medvDerived-test.pred)^2)
MSE1

## [1] 0.1119954

The increase observed in the test MSE is shown above.

4. (3+2+2 =7 pts) Robust Regression

Consider the dataset provided (rrdata.RData), where one is interested in predicting crimes per 100,000 people (crime), given the percent of population living under poverty line (poverty), and percent of population that are single parents (single).

You are trying to predict “crime” based on only two independent variables, “poverty” and “single”.

(a) Try to predict ”crime” using (i) an MLR (using lm function), and then (ii) a robust regression using Huber loss on rrdata. For (ii), use the rlm function, available in MASS package, for robust regression variant that uses Huber loss as follows. Report “Residual Standard Error” for both MLR and robust regression with Huber loss.

load(file="rrdata.RData")
names(rrdata)

## [1] "sid"      "state"    "crime"    "murder"   "pctmetro" "pctwhite"
## [7] "pcths"    "poverty"  "single"

#MLR model
crime.fit=lm(crime ~ poverty + single, data = rrdata)
summary(crime.fit)

## 
## Call:
## lm(formula = crime ~ poverty + single, data = rrdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -811.14 -114.27  -22.44  121.86  689.82 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1368.189    187.205  -7.308 2.48e-09 ***
## poverty         6.787      8.989   0.755    0.454    
## single        166.373     19.423   8.566 3.12e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 243.6 on 48 degrees of freedom
## Multiple R-squared:  0.7072, Adjusted R-squared:  0.695 
## F-statistic: 57.96 on 2 and 48 DF,  p-value: 1.578e-13

The Residual Standard Error in the MLR model is shown above: 243.6

library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked _by_ '.GlobalEnv':
## 
##     Boston

rr.huber <- rlm(crime ~ poverty + single, data = rrdata)
summary(rr.huber)

## 
## Call: rlm(formula = crime ~ poverty + single, data = rrdata)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -846.09 -125.80  -16.49  119.15  679.94 
## 
## Coefficients:
##             Value      Std. Error t value   
## (Intercept) -1423.0373   167.5899    -8.4912
## poverty         8.8677     8.0467     1.1020
## single        168.9858    17.3878     9.7186
## 
## Residual standard error: 181.8 on 48 degrees of freedom

As seen above, the Residual Standard Error significantly decreased with the more robust Huber loss model: 181.8

(b) Try to identify at most five outliers in this dataset so that if you remove these 5 points the error decreases the most. For example you can look at a plot of actual versus fitted values, or a plot of the residuals to help you determine these 5 points. Remove the outliers and call the resulting dataset rrdataNoOutlier. Fit MLR and robust regression on rrdataNoOutlier, and report residual standard errors.

MLR_outliers = sort(abs(resid(crime.fit)), decreasing=TRUE)[1:5]
MLR_outliers

##       25        9       51       46       26 
## 811.1373 689.8233 434.1663 415.7843 351.7679

outliers = c(25,9,51,46,26)

plot(abs(resid(crime.fit)), main="Outlier Detection for Residuals", ylab="Residual Standard Error", xlab = "X Values", col=ifelse(rrdata$sid %in% outliers,"dark blue","dark red"), lwd = 3, pch = 4, cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5)

plot of chunk unnamed-chunk-7

#creating rrdataNOoutlier
rrdataNOoutlier <- rrdata[-outliers,]
names(rrdataNOoutlier)

## [1] "sid"      "state"    "crime"    "murder"   "pctmetro" "pctwhite"
## [7] "pcths"    "poverty"  "single"

MLR.fit = lm(crime ~ poverty + single, data = rrdataNOoutlier)
summary(MLR.fit)

## 
## Call:
## lm(formula = crime ~ poverty + single, data = rrdataNOoutlier)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -355.35 -114.16  -30.70   96.06  321.65 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1168.379    192.998  -6.054 3.05e-07 ***
## poverty         8.500      6.521   1.303    0.199    
## single        147.054     18.526   7.938 5.81e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 166.7 on 43 degrees of freedom
## Multiple R-squared:  0.6592, Adjusted R-squared:  0.6433 
## F-statistic: 41.58 on 2 and 43 DF,  p-value: 8.901e-11

We can see an decrease in the Residual Standard Error from the first MLR model above: 166.7

rr.huber2 <- rlm(crime ~ poverty + single, data = rrdataNOoutlier)
summary(rr.huber2)

## 
## Call: rlm(formula = crime ~ poverty + single, data = rrdataNOoutlier)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -348.79 -106.77  -23.25  101.04  329.26 
## 
## Coefficients:
##             Value      Std. Error t value   
## (Intercept) -1155.2532   211.2772    -5.4679
## poverty         8.8081     7.1391     1.2338
## single        144.8869    20.2806     7.1441
## 
## Residual standard error: 158 on 43 degrees of freedom

Again, we can see another decrease in the Residual Standard Error from the first Huber loss model: 158

C. Compare and comment on the model fits (plot actual vs. predicted values) obtained in (a) and (b). How do outliers affect the relative performance of MLR and robust regression with Huber loss?

#actual vs. predicted values for MLR model w/ outliers 
plot(predict(crime.fit), rrdata$crime, col=ifelse(rrdata$sid %in% outliers,"blue","dark orange"), lwd = 3, main = "Actual vs. Predicted Crime Rates for MLR Model", sub = "With Outliers", xlab = "Predicted Crime", ylab = "Actual Crime", cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5)

plot of chunk unnamed-chunk-9

#same with no outliers for MLR (using rrdataNOoutlier data set)
plot(predict(MLR.fit), rrdataNOoutlier$crime, col="dark orange", lwd = 3, main = "Actual vs. Predicted Crime Rates for MLR Model", sub = "Without Outliers", xlab = "Predicted Crime", ylab = "Actual Crime", cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5)

plot of chunk unnamed-chunk-9

#actual vs. predicted values for Huber model w/ outliers 
plot(predict(rr.huber), rrdata$crime, col=ifelse(rrdata$sid %in% outliers,"dark green","dark red"), lwd = 3, main = "Actual vs. Predicted Crime Rates for Huber Loss Model", sub = "With Outliers", xlab = "Predicted Crime", ylab = "Actual Crime", cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5)

plot of chunk unnamed-chunk-9

#same with no outliers for Huber Loss model (using rrdataNOoutlier data set)
plot(predict(rr.huber2), rrdataNOoutlier$crime, col="dark red", lwd = 3, main = "Actual vs. Predicted Crime Rates for Huber Loss Model", sub = "Without Outliers", xlab = "Predicted Crime", ylab = "Actual Crime", cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5)

plot of chunk unnamed-chunk-9

It's clear that outliers affect the relative performance of regular MLR and more robust methods such as Huber's Loss model. Performance significantly improved as observed through the decrease in Residual Standard Error scores for both models using the “rrdataNOoutlier” data set. There was less of an improvement in the RSE metric with Huber's Loss model. However, that is because Huber's Loss is more robust to the effect of outliers in the first place.