Compare the classification performance of linear regression and k???nearest neighbor classification on the zipcode data. In particular, consider only the 2???s and 3???s, and k = 1, 3, 5, 7 and 15. Show both the training and test error for each choice.
Reading the train and test files for independent and dependent variables
#Train file as input
X <- as.matrix(read.table(gzfile("zip.train.gz")))
y2or3 <- which(X[, 1] == 2 | X[, 1] == 3)
X.train <- X[y2or3, -1]
y.train <- X[y2or3, 1] == 3
#Test file as input
X <- as.matrix(read.table(gzfile("zip.test.gz")))
y2or3 <- which(X[, 1] == 2 | X[, 1] == 3)
X.test <- X[y2or3, -1]
y.test <- X[y2or3, 1] == 3
Classification by linear regression
## [1] 0.04120879
Classification by k-nearest neighbors
## Error Rate
## Linear Regression 0.04120879
## k-NN with k = 1 0.02472527
## k-NN with k = 3 0.03021978
## k-NN with k = 5 0.03021978
## k-NN with k = 7 0.03296703
## k-NN with k = 15 0.03846154
Comparing the results using plots
Comments:
What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?
A more flexible approach would be preferred to a less flexible approach when we are interested in prediction and not the interpretability of the results. [More useful for prediction than inference]
A less flexible approach would be preferred to a more flexible approach when we are interested in inference and the interpretability of the results.
The table below provides a training data set containing six observations, three predictors, and one qualitative response variable. Obs.
Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors. #####(5.a.) Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0.
## X1 X2 X3 Y
## 1 0 3 0 R
## 2 2 0 0 R
## 3 0 1 3 R
## 4 0 1 2 G
## 5 -1 0 1 G
## 6 1 1 1 R
## X1 X2 X3 Y Euclidean_Distance
## a 0 3 0 R 3.000000
## b 2 0 0 R 2.000000
## c 0 1 3 R 3.162278
## d 0 1 2 G 2.236068
## e -1 0 1 G 1.414214
## f 1 1 1 R 1.732051
## [1] G
## Levels: G R
Answer:
Since the testset X1=X2=X3=0, this is close to Green which is at a distance of sqrt(2)=1.414. Thus, the prediction will be Green.
## [1] R
## Levels: G R
Answer:
Since the testset X1=X2=X3=0, closest ones are Red (Obs 2), Green(Obs 5) and Red(Obs 6). Thus, the Prediction will be Red
If the Bayes decision boundary is highly non-linear it will be easy to fit in data with smaller k values. Whereas when the boundary becomes more rigid, we need large k values. Hence in this case we would expect the k value to be small.
## [1] 506 14
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
Answer:
There are 506 rows and 14 columns. Rows represent the observations, that is a total of 506 cases . Columns represent the 14 attributes which describes the data.
Answer:
Answers
rad index of accessibility to radial highways. There appears to be a positive correlation to crime rate
tax full-value property-tax rate per $10,000. There appears to be a positive correlation to crime rate.
lstat lower status of the population (percent). There appears to be a positive correlation to crime rate.
We can arrive at the solution for this question by performing an outlier analysis for each of these attributes [data points lying outside the inter-quartile-range*1.5 times are the outliers to be considered in our case].
Answer:
1. Crime Rate: The per capita crime rate by town. This appears to have many outliers as seen in the boxplot below. The majority of towns have close to zero crime rate (of under 5), while some suburbs have a rate as high as 70 or 80. There are many outlier suburbs marked as outliers indicating most suburbs have a very low crime rate, although there are a significant number of towns in the minority with an outlying crime rate on the higher end (eg. 54 towns have crim > 10).
2. Tax Rate: The full-value property-tax rate per $10,000. The property tax ranges from approximately 200 to 700, with no extreme outliers. The median lies around 320 dollars.
3. Pupil-Teacher Ratios: The pupil-teacher ratio by town. the pupil teach ratio ranges from 12.5 to approximately 22. The Interquartile range lies between 17.5 and 20, and there are a couple of non extreme outliers around the 12.5 range.
The variable chas provides information about the Charles River (chas= 1 if tract bounds river; chas=0 otherwise).
##
## 0 1
## 471 35
Answer: In our case 35 suburbs are bound by the Charles river.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.60 17.40 19.05 18.46 20.20 22.00
## [1] 19.05
Answer: The median for pupil-teacher ratio among the towns in this data set is 19.05
There are two suburbs which have the lowest median value of owner-occupied homes. These can be seen below: [he values of the other predictors for these suburbs are shown as well]
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1: 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 396.90
## 2: 67.9208 0 18.1 0 0.693 5.683 100 1.4254 24 666 20.2 384.97
## lstat medv
## 1: 30.59 5
## 2: 22.98 5
## [1] 399 406
The table below compares the suburbs with the lowest median value of owner-occupied homes, to summary statistics of the wider population of all towns. The crime rates are obviously on the very upper end of the city, although not the suburbs with the highest crime.
## crim zn indus chas nox rm age
## 1: 38.351800 0.00000 18.10000 0.00000 0.693000 5.453000 100.0000
## 2: 67.920800 0.00000 18.10000 0.00000 0.693000 5.683000 100.0000
## 3: 0.006320 0.00000 0.46000 0.00000 0.385000 3.561000 2.9000
## 4: 88.976200 100.00000 27.74000 1.00000 0.871000 8.780000 100.0000
## 5: 0.082045 0.00000 5.19000 0.00000 0.449000 5.885500 45.0250
## 6: 3.677082 12.50000 18.10000 0.00000 0.624000 6.623500 94.0750
## 7: 3.613524 11.36364 11.13678 0.06917 0.554695 6.284634 68.5749
## 8: 0.256510 0.00000 9.69000 0.00000 0.538000 6.208500 77.5000
## dis rad tax ptratio black lstat medv
## 1: 1.489600 24.000000 666.0000 20.20000 396.9000 30.59000 5.00000
## 2: 1.425400 24.000000 666.0000 20.20000 384.9700 22.98000 5.00000
## 3: 1.129600 1.000000 187.0000 12.60000 0.3200 1.73000 5.00000
## 4: 12.126500 24.000000 711.0000 22.00000 396.9000 37.97000 50.00000
## 5: 2.100175 4.000000 279.0000 17.40000 375.3775 6.95000 17.02500
## 6: 5.188425 24.000000 666.0000 20.20000 396.2250 16.95500 25.00000
## 7: 3.795043 9.549407 408.2372 18.45553 356.6740 12.65306 22.53281
## 8: 3.207450 5.000000 330.0000 19.05000 391.4400 11.36000 21.20000
Comments:
The number of the suburbs average more than seven rooms and eight rooms per dwelling are given below:
## More_than_7_rooms More_than_8_rooms
## [1,] 64 13
Comments:
On analysing the dataset, it is evident that the 13 suburbs with on average over eight rooms per dwellings have a low crime rate, generally a high median value of the dwellings (with one exception under average) and generally a low pupil teacher ratio. The full-value property-tax rate per $10,000 is low, indicating that property tax is not judged off house value. In general there are not many non retail busiensses indicating they are purely residential areas. There are generally older houses with some exceptions with very low proportion of old buildings. They seem to be further away from the highways than average, again indicating purely residential areas perhaps on the outskirts. Other variables not mentioned above seem to be broadly consistent with the wider population.
This question involves the use of multiple linear regression on the Auto data set.
Correlation between the variables are given below:
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7762599 -0.8044430 NA -0.8317389
## cylinders -0.7762599 1.0000000 0.9509199 NA 0.8970169
## displacement -0.8044430 0.9509199 1.0000000 NA 0.9331044
## horsepower NA NA NA 1 NA
## weight -0.8317389 0.8970169 0.9331044 NA 1.0000000
## acceleration 0.4222974 -0.5040606 -0.5441618 NA -0.4195023
## year 0.5814695 -0.3467172 -0.3698041 NA -0.3079004
## origin 0.5636979 -0.5649716 -0.6106643 NA -0.5812652
## acceleration year origin
## mpg 0.4222974 0.5814695 0.5636979
## cylinders -0.5040606 -0.3467172 -0.5649716
## displacement -0.5441618 -0.3698041 -0.6106643
## horsepower NA NA NA
## weight -0.4195023 -0.3079004 -0.5812652
## acceleration 1.0000000 0.2829009 0.2100836
## year 0.2829009 1.0000000 0.1843141
## origin 0.2100836 0.1843141 1.0000000
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Answer: This can be explained by testing the null hypothesis that is there is no relationship between independent and dependent variables. The p-value corresponding to the F-statistic is less than 2.2e-16 which is very low, this indicates a that there is a relationship between “mpg” and the other predictors i.e. we reject the null hypothesis.
We can answer this question by checking the p-values associated with the t-statistic. We may conclude that ‘displacement’, ‘weight’, ‘year’ and ‘origin’ arestatistically significant.
0.750773 is the coefficient for the ‘year’ variable. This suggests that the average effect of an increase of one year is an increase of 0.750773 in “mpg” considering all other predictors remain constant
Answer:
##
## Call:
## lm(formula = mpg ~ cylinders * displacement + displacement *
## weight, data = Auto[, 1:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.3564 -2.4882 -0.3635 1.8469 17.8176
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.285e+01 2.233e+00 23.673 < 2e-16 ***
## cylinders 7.580e-01 7.645e-01 0.992 0.322
## displacement -7.514e-02 1.669e-02 -4.502 8.90e-06 ***
## weight -9.931e-03 1.323e-03 -7.505 4.19e-13 ***
## cylinders:displacement -2.893e-03 3.424e-03 -0.845 0.399
## displacement:weight 2.147e-05 4.996e-06 4.298 2.18e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.115 on 391 degrees of freedom
## Multiple R-squared: 0.7269, Adjusted R-squared: 0.7234
## F-statistic: 208.2 on 5 and 391 DF, p-value: < 2.2e-16
Answer: The p-values indicate that the interaction between displacement and weight is statistically signifcant, while the interaction between cylinders and displacement is not.
Answer:The log transformation gives the most linear looking plot for weight
Answer:The square root transformation gives a more linear looking graph for displacement
Answer: All three plots seem to be very scattered around y-axis for acceleration
Answer: The log transformation gives the most linear looking plot for horsepower