1. In a short paragraph (3-5 sentences), identify one problem or challenge that could be addressed, at least partially, through:

a. Predictive modeling

b. Inference

c. Clustering (unsupervised learning)

A problem that could be addressed via predictive modeling, inference, and clustering could be UCSD waitlist prediction. Predicting whether one will get off of the waitlist is a constant challenge for UCSD and other university students. It would be possible to infer from previous data what the relationship between waitlist location, class waitlisted on, current day into the possible enrollment period, speed with which the class filled up, etc. and chance of getting in. Some of these predictors, such as which class or which department the class is in, would likely be benefited by clustering. This model could then be used to predict future chance of getting in, and take some weight off of students’ minds!

2. ISLR problem 2.1

For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

(a) The sample size n is extremely large, and the number of predictors p is small.

Flexible would be good, as it is unlikely to underfit or have too much bias.

(b) The number of predictors p is extremely large, and the number of observations n is small.

Inflexible would be good, as it is better at making extrapolatory guesses with low information most of the time and is less likely to overfit.

(c) The relationship between the predictors and response is highly non-linear.

Flexible would be good, as an inflexible model is unlikely to grasp the nuance of the relationship between the predictors and response.

(d) The variance of the error terms, i.e. σ2 = Var(ε), is extremely high.

Inflexible would be good, as flexible models are highly impacted by variance.

3. ISLR problem 2.7

The table below provides a training data set containing six observations, three predictors, and one qualitative response variable.

Obs. X1 X2 X3 Y

1 0 3 0 Red

2 2 0 0 Red

3 0 1 3 Red

4 0 1 2 Green

5 −1 0 1 Green

6 1 1 1 Red

Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors.

(a) Compute the Euclidean distance between each observation and the test point,X1 =X2 =X3 =0.

obs.1 = c(0,3,0)
obs.2 = c(2,0,0)
obs.3 = c(0,1,3)
obs.4 = c(0,1,2)
obs.5 = c(-1,0,1)
obs.6 = c(1,1,1)
obs.colors = c("Red", "Red", "Red", "Green", "Green", "Red")

distances = c(0,0,0,0,0,0)
names(distances) = c("Obs 1", "Obs 2", "Obs 3", "Obs 4", "Obs 5", "Obs 6")

distances[1] = sqrt(obs.1[1]^2 + obs.1[2]^2 + obs.1[3]^2)
distances[2] = sqrt(obs.2[1]^2 + obs.2[2]^2 + obs.2[3]^2)
distances[3] = sqrt(obs.3[1]^2 + obs.3[2]^2 + obs.3[3]^2)
distances[4] = sqrt(obs.4[1]^2 + obs.4[2]^2 + obs.4[3]^2)
distances[5] = sqrt(obs.5[1]^2 + obs.5[2]^2 + obs.5[3]^2)
distances[6] = sqrt(obs.6[1]^2 + obs.6[2]^2 + obs.6[3]^2)

print(distances)
##    Obs 1    Obs 2    Obs 3    Obs 4    Obs 5    Obs 6 
## 3.000000 2.000000 3.162278 2.236068 1.414214 1.732051

(b) What is our prediction with K = 1? Why?

The nearest observation is 5, which is green. Thus, we expect the test point to be green.

(c) What is our prediction with K = 3? Why?

The nearest 3 observations are 2, 5, and 6, which are red, green, and red respectively. Thus we expect the test point to be red since most of the nearest points are red.

(d) If the Bayes decision boundary in this problem is highly non-linear, then would we expect the best value for K to be large or small? Why?

We would expect the best value for K to be small, as highly non-linear relationships need high-flexibility interpretation.

4.Applied exercise: Download the data set Income2.csv from the textbook’s website

(http://www-bcf.usc.edu/~gareth/ISL/data.html). Load this data set into your favorite data analysis software environment (MATLAB, Python or R). In MATLAB, you could use the commands readtable or csvread. NOTE: Please include your code.

a. Make a scatter plot showing years of education on the x-axis vs. income (in thousands of dollars) on the y-axis. Make sure to label the x and y axes (in MATLAB, use the functions xlabel and ylabel).

b. Calculate the mean income level for this data set

c. Calculate the standard deviation of the income level

d. Calculate the standard error of the mean (SEM)

e. Create a new categorical variable called HigherEd. This variable is defined to be 1 if the subject has ≥16 years of education, and 0 otherwise. Make a box plot comparing the income level of subjects with HigherEd=0 vs. HigherEd=1.

require(matrixStats)
## Loading required package: matrixStats
income.data <- read.csv("/Users/rdoctor/Desktop/Cogs\ 109\ hw1/Income2.csv") #or whatever the Income2 file's path is

plot(income.data[["Education"]], income.data[["Income"]], xlab = "Education", ylab = "Income")

income.mean <- colMeans(income.data)[["Income"]]
print(income.mean)
## [1] 62.74473
income.sd <- colSds(data.matrix(income.data))[4]
print(income.sd)
## [1] 27.01328
income.sem <- income.sd / sqrt(length(income.data[["X"]]))
print(income.sem)
## [1] 4.931929
HigherEd = income.data[["Education"]] >= 16
print(HigherEd)
##  [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
## [12]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## [23]  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE