Question 1.

Compare the classification performance of linear regression and k???nearest neighbor classification on the zipcode data. In particular, consider only the 2???s and 3???s, and k = 1, 3, 5, 7 and 15. Show both the training and test error for each choice.

Reading the train and test files for independent and dependent variables

#Train file as input
X <- as.matrix(read.table(gzfile("zip.train.gz")))
y2or3 <- which(X[, 1] == 2 | X[, 1] == 3)
X.train <- X[y2or3, -1]
y.train <- X[y2or3, 1] == 3

#Test file as input
X <- as.matrix(read.table(gzfile("zip.test.gz")))
y2or3 <- which(X[, 1] == 2 | X[, 1] == 3)
X.test <- X[y2or3, -1]
y.test <- X[y2or3, 1] == 3

Classification by linear regression

## [1] 0.04120879

Classification by k-nearest neighbors

##                   Error Rate
## Linear Regression 0.04120879
## k-NN with k = 1   0.02472527
## k-NN with k = 3   0.03021978
## k-NN with k = 5   0.03021978
## k-NN with k = 7   0.03296703
## k-NN with k = 15  0.03846154

Comparing the results using plots

Comments:

  1. In this case, the k-NN with small k values outperforms linear regression.
  2. In case of k-NN procedures, the smaller k gives better performance. This is because of the Curse of Dimensionality problem
  3. With 256 features, the data points are spread out so far that often their ‘nearest neighbors’ are actually not very near them.

Question 2. SELF-STUDY

Question 3.

What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

Advantages of a very flexible approach are as follows:
  1. Might give a better fit for non-linear models and
  2. Decreases the bias.
Disadvantages of a very flexible approach are as follows:
  1. It requires estimating a greater number of parameters
  2. It follows the noise too closely (overfit) and
  3. It increases the variance.

A more flexible approach would be preferred to a less flexible approach when we are interested in prediction and not the interpretability of the results. [More useful for prediction than inference]

A less flexible approach would be preferred to a more flexible approach when we are interested in inference and the interpretability of the results.

Question 4.

The table below provides a training data set containing six observations, three predictors, and one qualitative response variable. Obs.

X1 X2 X3 Y
1 0 3 0 Red
2 2 0 0 Red
3 0 1 3 Red
4 0 1 2 Green
5 ???1 0 1 Green
6 1 1 1 Red

Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors. #####(5.a.) Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0.

##   X1 X2 X3 Y
## 1  0  3  0 R
## 2  2  0  0 R
## 3  0  1  3 R
## 4  0  1  2 G
## 5 -1  0  1 G
## 6  1  1  1 R
##   X1 X2 X3 Y Euclidean_Distance
## a  0  3  0 R           3.000000
## b  2  0  0 R           2.000000
## c  0  1  3 R           3.162278
## d  0  1  2 G           2.236068
## e -1  0  1 G           1.414214
## f  1  1  1 R           1.732051
(4.b.) What is our prediction with K = 1? Why?
## [1] G
## Levels: G R

Answer:

Since the testset X1=X2=X3=0, this is close to Green which is at a distance of sqrt(2)=1.414. Thus, the prediction will be Green.

(4.c.) What is our prediction with K = 3? Why?
## [1] R
## Levels: G R

Answer:

Since the testset X1=X2=X3=0, closest ones are Red (Obs 2), Green(Obs 5) and Red(Obs 6). Thus, the Prediction will be Red

(4.d.) If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for K to be large or small? Why?
Answer:

If the Bayes decision boundary is highly non-linear it will be easy to fit in data with smaller k values. Whereas when the boundary becomes more rigid, we need large k values. Hence in this case we would expect the k value to be small.

Question 5.

This exercise involves the Boston housing data set.
(5.a.) To begin, load in the Boston data set. The Boston data set is part of the MASS library in R. How many rows are in this data set? How many columns? What do the rows and columns represent?
## [1] 506  14
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

Answer:

There are 506 rows and 14 columns. Rows represent the observations, that is a total of 506 cases . Columns represent the 14 attributes which describes the data.

(5.b.) Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

Answer:

Drilling into some scatter plots, we can observe the following:
  1. The pairwise plots compare the median value (medv) against other features in the data frame
  2. High correlations between medv & lstat (lower status of the population)
  3. High correlations between medv & rm (average number of rooms per dwelling)
  4. Istat is negatively correlated to medv
  5. rm is positively correlated which would be expected as the higher value areas would typically have larger dwellings
  6. Lower correlation of median value is seen with chas (Charles River Dummy Variable) and dis (distance to employment centres).
(5.c.) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

Below is summarized the predictors with the highest correlation.

Answers

From the above graphs of pairwise plots showing the correlation with per capita crime rate, column crim, the highest correlation appear to be the following features in order of the highest correlation to crim :
  1. rad index of accessibility to radial highways. There appears to be a positive correlation to crime rate

  2. tax full-value property-tax rate per $10,000. There appears to be a positive correlation to crime rate.

  3. lstat lower status of the population (percent). There appears to be a positive correlation to crime rate.

In all cases, many areas have low crime (per capita crime rate <20), regardless of the accessibility to highways, tax rate or lower status of the population. However there are a few outlying suburbs with very high crime rates.
(5.d.) Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

We can arrive at the solution for this question by performing an outlier analysis for each of these attributes [data points lying outside the inter-quartile-range*1.5 times are the outliers to be considered in our case].

Answer:

1. Crime Rate: The per capita crime rate by town. This appears to have many outliers as seen in the boxplot below. The majority of towns have close to zero crime rate (of under 5), while some suburbs have a rate as high as 70 or 80. There are many outlier suburbs marked as outliers indicating most suburbs have a very low crime rate, although there are a significant number of towns in the minority with an outlying crime rate on the higher end (eg. 54 towns have crim > 10).

2. Tax Rate: The full-value property-tax rate per $10,000. The property tax ranges from approximately 200 to 700, with no extreme outliers. The median lies around 320 dollars.

3. Pupil-Teacher Ratios: The pupil-teacher ratio by town. the pupil teach ratio ranges from 12.5 to approximately 22. The Interquartile range lies between 17.5 and 20, and there are a couple of non extreme outliers around the 12.5 range.

(5.e.) How many of the suburbs in this data set bound the Charles river?

The variable chas provides information about the Charles River (chas= 1 if tract bounds river; chas=0 otherwise).

## 
##   0   1 
## 471  35

Answer: In our case 35 suburbs are bound by the Charles river.

(5.f.) What is the median pupil-teacher ratio among the towns in this data set?
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00
## [1] 19.05

Answer: The median for pupil-teacher ratio among the towns in this data set is 19.05

(5.g.) Which suburb of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

There are two suburbs which have the lowest median value of owner-occupied homes. These can be seen below: [he values of the other predictors for these suburbs are shown as well]

##       crim zn indus chas   nox    rm age    dis rad tax ptratio  black
## 1: 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 396.90
## 2: 67.9208  0  18.1    0 0.693 5.683 100 1.4254  24 666    20.2 384.97
##    lstat medv
## 1: 30.59    5
## 2: 22.98    5
## [1] 399 406

The table below compares the suburbs with the lowest median value of owner-occupied homes, to summary statistics of the wider population of all towns. The crime rates are obviously on the very upper end of the city, although not the suburbs with the highest crime.

##         crim        zn    indus    chas      nox       rm      age
## 1: 38.351800   0.00000 18.10000 0.00000 0.693000 5.453000 100.0000
## 2: 67.920800   0.00000 18.10000 0.00000 0.693000 5.683000 100.0000
## 3:  0.006320   0.00000  0.46000 0.00000 0.385000 3.561000   2.9000
## 4: 88.976200 100.00000 27.74000 1.00000 0.871000 8.780000 100.0000
## 5:  0.082045   0.00000  5.19000 0.00000 0.449000 5.885500  45.0250
## 6:  3.677082  12.50000 18.10000 0.00000 0.624000 6.623500  94.0750
## 7:  3.613524  11.36364 11.13678 0.06917 0.554695 6.284634  68.5749
## 8:  0.256510   0.00000  9.69000 0.00000 0.538000 6.208500  77.5000
##          dis       rad      tax  ptratio    black    lstat     medv
## 1:  1.489600 24.000000 666.0000 20.20000 396.9000 30.59000  5.00000
## 2:  1.425400 24.000000 666.0000 20.20000 384.9700 22.98000  5.00000
## 3:  1.129600  1.000000 187.0000 12.60000   0.3200  1.73000  5.00000
## 4: 12.126500 24.000000 711.0000 22.00000 396.9000 37.97000 50.00000
## 5:  2.100175  4.000000 279.0000 17.40000 375.3775  6.95000 17.02500
## 6:  5.188425 24.000000 666.0000 20.20000 396.2250 16.95500 25.00000
## 7:  3.795043  9.549407 408.2372 18.45553 356.6740 12.65306 22.53281
## 8:  3.207450  5.000000 330.0000 19.05000 391.4400 11.36000 21.20000

Comments:

  1. The proportion of residential land zoned for lots over 25,000 sq.ft. is 0, which is the minimum of the city and indicating that there is little investment in these suburbs.
  2. The proportion of non-retail business acres per town, lies within the 3rd quartiles for both suburbs. this indicates there are viable businesses and potentially employment in the area.
  3. The Charles River dummy variable only indicates the suburbs do not lie on the river side.
  4. The nitrogen oxides concentration (parts per 10 million) are in the upper quartile of the city, perhaps since the suburbs are so close to the highways.
  5. The average number of rooms per dwelling is in the lower quartile indicating smaller apartments or houses, however it is not at a minimimum. Perhaps the areas closer to the city have smaller apartments instead of houses.
  6. The age proportion of owner-occupied units built prior to 1940, is at the maximum indicating old housing units and no new builds.
  7. The weighted mean of distances to five Boston employment centres is well into the lower quartile and close to the minimum. this indicates an area of high unemployment.
  8. The index of accessibility to radial highways is at the maximum indicating that the areas lit on or very near a highway.
  9. The full-value property-tax rate per $10,000 is quite high and in the upper quartile. Perhaps this is due to the small apartments/houses, in a city where the tax and unit area is not linearly correlated - so larger units actually pay less per square feet.
  10. The pupil-teacher ratio in these areas is in the upper quartile suggesting some relative under investment in schooling.
  11. The variable black is around the median of all suburbs, indicating not a predominantly black population compared to the rest of the city.
  12. The lower status of the population (percent) of population is close the maximum and high in the upper quartile.
(5.h.) In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.

The number of the suburbs average more than seven rooms and eight rooms per dwelling are given below:

##      More_than_7_rooms More_than_8_rooms
## [1,]                64                13

Comments:

On analysing the dataset, it is evident that the 13 suburbs with on average over eight rooms per dwellings have a low crime rate, generally a high median value of the dwellings (with one exception under average) and generally a low pupil teacher ratio. The full-value property-tax rate per $10,000 is low, indicating that property tax is not judged off house value. In general there are not many non retail busiensses indicating they are purely residential areas. There are generally older houses with some exceptions with very low proportion of old buildings. They seem to be further away from the highways than average, again indicating purely residential areas perhaps on the outskirts. Other variables not mentioned above seem to be broadly consistent with the wider population.

Question 6.

This question involves the use of multiple linear regression on the Auto data set.

(6.a.) Produce a scatterplot matrix which includes all of the variables in the data set.

(6.b.) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

Correlation between the variables are given below:

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7762599   -0.8044430         NA -0.8317389
## cylinders    -0.7762599  1.0000000    0.9509199         NA  0.8970169
## displacement -0.8044430  0.9509199    1.0000000         NA  0.9331044
## horsepower           NA         NA           NA          1         NA
## weight       -0.8317389  0.8970169    0.9331044         NA  1.0000000
## acceleration  0.4222974 -0.5040606   -0.5441618         NA -0.4195023
## year          0.5814695 -0.3467172   -0.3698041         NA -0.3079004
## origin        0.5636979 -0.5649716   -0.6106643         NA -0.5812652
##              acceleration       year     origin
## mpg             0.4222974  0.5814695  0.5636979
## cylinders      -0.5040606 -0.3467172 -0.5649716
## displacement   -0.5441618 -0.3698041 -0.6106643
## horsepower             NA         NA         NA
## weight         -0.4195023 -0.3079004 -0.5812652
## acceleration    1.0000000  0.2829009  0.2100836
## year            0.2829009  1.0000000  0.1843141
## origin          0.2100836  0.1843141  1.0000000
(6.c.) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors.
(6.c.1) Is there a relationship between the predictors and the response?
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Answer: This can be explained by testing the null hypothesis that is there is no relationship between independent and dependent variables. The p-value corresponding to the F-statistic is less than 2.2e-16 which is very low, this indicates a that there is a relationship between “mpg” and the other predictors i.e. we reject the null hypothesis.

(6.c.2) Which predictors appear to have a statistically significant relationship to the response?
Answer:

We can answer this question by checking the p-values associated with the t-statistic. We may conclude that ‘displacement’, ‘weight’, ‘year’ and ‘origin’ arestatistically significant.

(6.c.3) What does the coefficient for the year variable suggest?
Answer:

0.750773 is the coefficient for the ‘year’ variable. This suggests that the average effect of an increase of one year is an increase of 0.750773 in “mpg” considering all other predictors remain constant

(6.d.) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

Answer:

  1. The plot of residuals versus fitted values indicates presence of mild non linearity in the data.
  2. The plot of standardized residuals versus leverage shows that there are few outliers higher than 2 or lower than -2.
  3. From the leverage plot, point 14 appears to have high leverage
(6.e.) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
## 
## Call:
## lm(formula = mpg ~ cylinders * displacement + displacement * 
##     weight, data = Auto[, 1:8])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.3564  -2.4882  -0.3635   1.8469  17.8176 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.285e+01  2.233e+00  23.673  < 2e-16 ***
## cylinders               7.580e-01  7.645e-01   0.992    0.322    
## displacement           -7.514e-02  1.669e-02  -4.502 8.90e-06 ***
## weight                 -9.931e-03  1.323e-03  -7.505 4.19e-13 ***
## cylinders:displacement -2.893e-03  3.424e-03  -0.845    0.399    
## displacement:weight     2.147e-05  4.996e-06   4.298 2.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.115 on 391 degrees of freedom
## Multiple R-squared:  0.7269, Adjusted R-squared:  0.7234 
## F-statistic: 208.2 on 5 and 391 DF,  p-value: < 2.2e-16

Answer: The p-values indicate that the interaction between displacement and weight is statistically signifcant, while the interaction between cylinders and displacement is not.

(6.f.) Try a few different transformations of the variables, such as log(X), ???X, X2. Comment on your findings.

Answer:The log transformation gives the most linear looking plot for weight

Answer:The square root transformation gives a more linear looking graph for displacement

Answer: All three plots seem to be very scattered around y-axis for acceleration

Answer: The log transformation gives the most linear looking plot for horsepower