Suppose I use a validation set to determine that K=13 is a good choice when applying the K-Nearest
Neighbor (KNN) Classifier to a given classification problem.
a. What are two main uses of a validation or holdout dataset. (answer should not include determining
K for a KNN classifier!)
The two main uses of a validation or holdout dataset are:
b. What is one principal drawback of the K-Nearest Neighbor (KNN) Classifier?
The principal drawback of the K-Nearest Neighbor Classifier is that it does not generalize well and is not robust in the face of noisy data. This is because the KNN algorithm uses all features equally in computing for similarities and assigning classes. When there is only a small subset of features that are useful for classification in the first place, this can lead to an increase in classification errors over more robust methods of classification.
Additionally, as discussed in class, the KNN approach is highly susceptible to “The Curse of Dimensionality” (i.e., Euclidean distances become very noisy for high dimensional problems).
c. If you used K=3 instead of 13 as determined for this problem, then would you more likely be
overfitting or underfitting the data? Briefly justify your choice.
K=3 simply means that a new case (data point) is assigned to its 3 nearest neighbors. Likewise, when K=13, a new case will be classified based on the classes of its 13 nearest neighbors. Both intuition and experience tell us that K=3 will most likely overfit the data, creating a more complex model while chasing patterns in the data that will not generalize well. While the training error rate might look good, the inability to generalize well will mostly likely result in a much higher test error rate. K=13 is a better choice although, in some cases, it may underfit the data and introduce bias into the model.
In general, increasing k will decrease variance and increase bias. While decreasing k will increase variance and decrease bias. This is what the bias-variance tradeoff is all about!
Suppose a manager at an internet sales company wants to estimate how fast his salesperson is generating
successful leads. Instead of recording the time for each lead, the time taken to generate the next 5
leads are recorded, i.e., there is one recording (denoting the elapsed time) for every 5 consecutive
leads. For a specific salesperson, the time intervals recorded are {1, 3, 1.5, 4, 2, 7, 1.2, 2, 4, 3.1} hours.
A statistician suggests that if these time intervals are assumed to arise by i.i.d. sampling from the
following distribution:
(where C is a normalizing constant), and if θ can be estimated, then he can provide detailed information
about the lead generation process, including average rates, variances etc.
Find the Maximum Likelihood estimate for θ based on the recorded observations.
In general, the method of maximum likelihood selects the set of values of the model parameters that maximizes the “agreement” of the selected model with the observed data. As we discussed in class, the criteria that maximizes the likelihood of a distribution is the sample mean and sample variance. Thus, using the data listed above, the following equation will attempt to estimate theta.
Note: The equation below is abbreviated to illustrate the process of how to calculate MLE and avoid taking up the entire page by repetitively plugging in all values of X (inputs).
Consider the dataset provided (BostonDerived.RData) which is derived from the “Boston Housing”
data set, part of the MASS package. This dataset records properties of 506 housing zones in the Greater
Boston area. For a description of the original data, see http://archive.ics.uci.edu/ml/datasets/Housing.
Typically one is interested in predicting MEDV (median home value) based on other attributes.
(a) Generate box-plots of the LSTAT (% of lower status in the population) and MEDV (median home
value) attributes and identify the cutoff values for outliers. Generate a scatterplot of MEDV
against LSTAT; comment on how inclusion of the outliers would affect a predictive model of
median home value as a function of % of lower status in the population. (Hint: Such effects may
be easier to visualize if the outliers are a different color or symbol than the other data.)
load(file="BostonDerived.RData")
attach(Boston)
boxplot(lstat, boxfill = "light gray", outpch = 21:25, outlty = 2, bg = "pink", lwd = 2, medcol = "dark green", medcex = 2, medpch = 20, ylim = c(0, 50), main = "Dispersion in LSTAT Variable w/ Outlier Detection", range = 1.5, cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5)
#cut-off values are given by the formal definition of an outlier:
#Q3 + 1.5*IQR
upper_cut_off1 <- (quantile(lstat, 0.75)) + (IQR(lstat))*1.5
upper_cut_off1
## 75%
## 31.9625
#Q1 - 1.5*IQR
lower_cut_off1 <- (quantile(lstat, 0.25)) - (IQR(lstat))*1.5
lower_cut_off1
## 25%
## -8.0575
boxplot(medv, boxfill = "light gray", outpch = 21:25, outlty = 2, bg = "red", lwd = 2, medcol = "dark blue", medcex = 2, medpch = 20, ylim = c(0, 50), main = "Dispersion in MEDV Variable w/ Outlier Detection", range = 1.5, cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5)
#cut-off values are given by the formal definition of an outlier:
#Q3 + 1.5*IQR
upper_cut_off <- (quantile(medv, 0.75)) + (IQR(medv))*1.5
upper_cut_off
## 75%
## 36.9625
#Q1 - 1.5*IQR
lower_cut_off <- (quantile(medv, 0.25)) - (IQR(medv))*1.5
lower_cut_off
## 25%
## 5.0625
#identifying outlier values for identification in scatter plot
outliers_lstat = boxplot(lstat, plot=FALSE)$out
outliers_medv = boxplot(medv, plot=FALSE)$out
plot(medv, lstat, col = ifelse(medv %in% outliers_medv, "red", ifelse(lstat %in% outliers_lstat, "green", "blue")), xlab = "Median Home Value", ylab = "% of Lower Status in the Population", main = "MEDV vs. LSTAT w/ Outlier Detection", type = "p", ylim = c(0, 50), lwd = 2, cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5)
abline(lm(medv ~ lstat), lwd = 2)
If we were to model “median home value” as a function of “% of lower status in the population” using OLS, our results would be affected by the outliers highlighted in the scatter plot above. This is because the method of least squares involves minimizing the sum of the squared vertical distances between each data point and the fitted line. If a model is fitted on data containing outliers, the line of best fit will be erroneously pulled towards the outliers and the resulting coefficient estimates will be off.
(b) Let us try to fit an MLR to this dataset, with MEDVDerived as the dependent variable. Keep
the first 300 records as a training set (call it Bostrain) which you will use to fit the model; the
remaining 206 will be used as a test set (Bostest). Use only the following variables (expressed in “R” form for convenience) in your model: MEDV Derived ∼ LSTAT + RM + CHAS + INDUS + TAX + RAD + BLACK.
Bostrain <- Boston[1:300,]
Bostest <- Boston[301:506,]
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox"
## [6] "rm" "age" "dis" "rad" "tax"
## [11] "ptratio" "black" "lstat" "medv" "medvDerived"
lm.fit=lm(medvDerived ~ lstat + rm + chas + indus + tax + rad + black, data = Bostrain)
summary(lm.fit)
##
## Call:
## lm(formula = medvDerived ~ lstat + rm + chas + indus + tax +
## rad + black, data = Bostrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.36660 -0.07622 -0.00430 0.07071 0.56637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.4148850 0.1407009 10.056 < 2e-16 ***
## lstat -0.0153768 0.0018120 -8.486 1.09e-15 ***
## rm 0.2917835 0.0150922 19.333 < 2e-16 ***
## chas 0.0612613 0.0269690 2.272 0.02384 *
## indus 0.0013798 0.0015540 0.888 0.37533
## tax -0.0005450 0.0001353 -4.027 7.20e-05 ***
## rad 0.0021081 0.0050611 0.417 0.67733
## black 0.0005371 0.0001855 2.894 0.00408 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1304 on 292 degrees of freedom
## Multiple R-squared: 0.8401, Adjusted R-squared: 0.8362
## F-statistic: 219.1 on 7 and 292 DF, p-value: < 2.2e-16
(a) Report the coefficients obtained by your model. Would you prefer to drop any of the variables
used in your model (based on the t-scores or p-values)?
Yes, I would prefer to drop the variables that are not deemed statistically significant at the 95% confidence level (indus, rad), as evidenced by both their t-scores and corresponding p-values.
lm.fit=lm(medvDerived ~ lstat + rm + chas + tax + black, data = Bostrain)
MSE = mean(lm.fit$residuals^2)
summary(lm.fit)
##
## Call:
## lm(formula = medvDerived ~ lstat + rm + chas + tax + black, data = Bostrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.36912 -0.07477 0.00037 0.07206 0.56043
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.4217911 0.1401846 10.142 < 2e-16 ***
## lstat -0.0148977 0.0017241 -8.641 3.64e-16 ***
## rm 0.2917394 0.0149576 19.504 < 2e-16 ***
## chas 0.0656851 0.0264573 2.483 0.01360 *
## tax -0.0004837 0.0001160 -4.171 4.00e-05 ***
## black 0.0005117 0.0001830 2.796 0.00551 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1301 on 294 degrees of freedom
## Multiple R-squared: 0.8396, Adjusted R-squared: 0.8369
## F-statistic: 307.8 on 5 and 294 DF, p-value: < 2.2e-16
(b) Report the MSE obtained on Bostrain. How much does this increase when you score your
model on Bostest?
MSE
## [1] 0.01659244
#Report the MSE obtained on Bostrain
test.pred = predict(lm.fit, Bostest)
MSE1 = mean((Bostest$medvDerived-test.pred)^2)
MSE1
## [1] 0.1119954
The increase observed in the test MSE is shown above.
Consider the dataset provided (rrdata.RData), where one is interested in predicting crimes per 100,000
people (crime), given the percent of population living under poverty line (poverty), and percent of
population that are single parents (single).
You are trying to predict “crime” based on only two independent variables, “poverty” and “single”.
(a) Try to predict ”crime” using (i) an MLR (using lm function), and then (ii) a robust regression
using Huber loss on rrdata. For (ii), use the rlm function, available in MASS package, for robust
regression variant that uses Huber loss as follows. Report “Residual Standard Error” for both MLR and robust regression with Huber loss.
load(file="rrdata.RData")
names(rrdata)
## [1] "sid" "state" "crime" "murder" "pctmetro" "pctwhite"
## [7] "pcths" "poverty" "single"
#MLR model
crime.fit=lm(crime ~ poverty + single, data = rrdata)
summary(crime.fit)
##
## Call:
## lm(formula = crime ~ poverty + single, data = rrdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -811.14 -114.27 -22.44 121.86 689.82
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1368.189 187.205 -7.308 2.48e-09 ***
## poverty 6.787 8.989 0.755 0.454
## single 166.373 19.423 8.566 3.12e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 243.6 on 48 degrees of freedom
## Multiple R-squared: 0.7072, Adjusted R-squared: 0.695
## F-statistic: 57.96 on 2 and 48 DF, p-value: 1.578e-13
The Residual Standard Error in the MLR model is shown above: 243.6
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked _by_ '.GlobalEnv':
##
## Boston
rr.huber <- rlm(crime ~ poverty + single, data = rrdata)
summary(rr.huber)
##
## Call: rlm(formula = crime ~ poverty + single, data = rrdata)
## Residuals:
## Min 1Q Median 3Q Max
## -846.09 -125.80 -16.49 119.15 679.94
##
## Coefficients:
## Value Std. Error t value
## (Intercept) -1423.0373 167.5899 -8.4912
## poverty 8.8677 8.0467 1.1020
## single 168.9858 17.3878 9.7186
##
## Residual standard error: 181.8 on 48 degrees of freedom
As seen above, the Residual Standard Error significantly decreased with the more robust Huber loss model: 181.8
(b) Try to identify at most five outliers in this dataset so that if you remove these 5 points the
error decreases the most. For example you can look at a plot of actual versus fitted values, or
a plot of the residuals to help you determine these 5 points. Remove the outliers and call the
resulting dataset rrdataNoOutlier. Fit MLR and robust regression on rrdataNoOutlier, and
report residual standard errors.
MLR_outliers = sort(abs(resid(crime.fit)), decreasing=TRUE)[1:5]
MLR_outliers
## 25 9 51 46 26
## 811.1373 689.8233 434.1663 415.7843 351.7679
outliers = c(25,9,51,46,26)
plot(abs(resid(crime.fit)), main="Outlier Detection for Residuals", ylab="Residual Standard Error", xlab = "X Values", col=ifelse(rrdata$sid %in% outliers,"dark blue","dark red"), lwd = 3, pch = 4, cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5)
#creating rrdataNOoutlier
rrdataNOoutlier <- rrdata[-outliers,]
names(rrdataNOoutlier)
## [1] "sid" "state" "crime" "murder" "pctmetro" "pctwhite"
## [7] "pcths" "poverty" "single"
MLR.fit = lm(crime ~ poverty + single, data = rrdataNOoutlier)
summary(MLR.fit)
##
## Call:
## lm(formula = crime ~ poverty + single, data = rrdataNOoutlier)
##
## Residuals:
## Min 1Q Median 3Q Max
## -355.35 -114.16 -30.70 96.06 321.65
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1168.379 192.998 -6.054 3.05e-07 ***
## poverty 8.500 6.521 1.303 0.199
## single 147.054 18.526 7.938 5.81e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 166.7 on 43 degrees of freedom
## Multiple R-squared: 0.6592, Adjusted R-squared: 0.6433
## F-statistic: 41.58 on 2 and 43 DF, p-value: 8.901e-11
We can see an decrease in the Residual Standard Error from the first MLR model above: 166.7
rr.huber2 <- rlm(crime ~ poverty + single, data = rrdataNOoutlier)
summary(rr.huber2)
##
## Call: rlm(formula = crime ~ poverty + single, data = rrdataNOoutlier)
## Residuals:
## Min 1Q Median 3Q Max
## -348.79 -106.77 -23.25 101.04 329.26
##
## Coefficients:
## Value Std. Error t value
## (Intercept) -1155.2532 211.2772 -5.4679
## poverty 8.8081 7.1391 1.2338
## single 144.8869 20.2806 7.1441
##
## Residual standard error: 158 on 43 degrees of freedom
Again, we can see another decrease in the Residual Standard Error from the first Huber loss model: 158
C. Compare and comment on the model fits (plot actual vs. predicted values) obtained in (a) and
(b). How do outliers affect the relative performance of MLR and robust regression with Huber
loss?
#actual vs. predicted values for MLR model w/ outliers
plot(predict(crime.fit), rrdata$crime, col=ifelse(rrdata$sid %in% outliers,"blue","dark orange"), lwd = 3, main = "Actual vs. Predicted Crime Rates for MLR Model", sub = "With Outliers", xlab = "Predicted Crime", ylab = "Actual Crime", cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5)
#same with no outliers for MLR (using rrdataNOoutlier data set)
plot(predict(MLR.fit), rrdataNOoutlier$crime, col="dark orange", lwd = 3, main = "Actual vs. Predicted Crime Rates for MLR Model", sub = "Without Outliers", xlab = "Predicted Crime", ylab = "Actual Crime", cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5)
#actual vs. predicted values for Huber model w/ outliers
plot(predict(rr.huber), rrdata$crime, col=ifelse(rrdata$sid %in% outliers,"dark green","dark red"), lwd = 3, main = "Actual vs. Predicted Crime Rates for Huber Loss Model", sub = "With Outliers", xlab = "Predicted Crime", ylab = "Actual Crime", cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5)
#same with no outliers for Huber Loss model (using rrdataNOoutlier data set)
plot(predict(rr.huber2), rrdataNOoutlier$crime, col="dark red", lwd = 3, main = "Actual vs. Predicted Crime Rates for Huber Loss Model", sub = "Without Outliers", xlab = "Predicted Crime", ylab = "Actual Crime", cex.lab = 1.5, cex.main = 1.5, cex.axis = 1.5)
It's clear that outliers affect the relative performance of regular MLR and more robust methods such as Huber's Loss model. Performance significantly improved as observed through the decrease in Residual Standard Error scores for both models using the “rrdataNOoutlier” data set. There was less of an improvement in the RSE metric with Huber's Loss model. However, that is because Huber's Loss is more robust to the effect of outliers in the first place.