Answer :
- Since, the response/output in this case is quantitative, hence it is a
Regression problem.
- We are most interested in identifying the factors that influence CEO’s
salary and not find the actual CEO’s salary, hence our goal is to find
Inference.
- n = 500
p = profit, number of employees, industry.
Answer :
- Since, the response/output in this case is binary
i.e. success/failure, hence it is a
Classification problem.
- We are most interested in predicting whether the launch of the product
would be a success or a failure. Hence our goal is to do
Prediction.
- n = 20
p = price charged, marketing budget, competition price, and 10 other
variables.
Answer :
- Since, the response/output in this case is quantitative i.e. % change
in the USD/Euro , hence it is a Regression
problem .
- We are most interested in predicting what the % change in the
USD/Euro. Hence our goal is to do
Prediction.
- n= 52 (weeks in year 2012) p = % change in the US market, % change
in the British market, % change in the German market.
Answer :
knitr::include_graphics("graph.png")
Answer :
- Squared bias. This is the error in our model introduced by the
difference of our approximation and the true underlying function. A more
flexible model will be increasingly similar, and the squared bias
therefore diminishes as the flexibility increases.
- Variance. In the limit of a model with no flexibility the variance
will be zero, since the model fit will be independent of the data. As
the flexibility increases the variance will increase as well since the
noise in a particular training set will correspondingly captured by the
model. The curve described by the variance is an monotonically
increasing function of the flexibility of the model.
- Training error. The training error is given by the average (squared)
difference between the predictions of the model and the observations. If
a model is very unflexible this can be quite high, but as the
flexibility increases this difference will decrease. If we consider
polynomials for example increasing the flexibility of the model might
mean increasing the degree of the polynomial to be fitted. The
additional degrees of freedom will decrease the average difference and
reduce the training error.
- Test error. The expected test error is given by the formula: Variance
+ Bias + Bayes error, all of which are non-negative. The Bayes error is
constant and a lower bound for the test error. The test error has a
minimum at an intermediate level of flexibility: not too flexible, so
that the variance does not dominate, and not too unflexible, so that the
squared bias is not too high. The plot of the test error thus resembles
sort of an upward (deformed) parabola: high for unflexible models,
decreasing as flexibility increases until it reaches a minimum. Then the
variance starts to dominate and the test error starts increasing. The
distance between this minimum and the Bayes irreducible error gives us
an idea of how well the best function in the hypothesis space will
fit.
- Bayes error. This term is constant since by definition it does not
depend on X and therefore on the flexibility of the model.
Answer :
knitr::include_graphics("q7.png")
Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors. (a) Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0.
dat <- data.frame(
"x1" = c(0, 2, 0, 0, -1, 1),
"x2" = c(3, 0, 1, 1, 0, 1),
"x3" = c(0, 0, 3, 2, 1, 1),
"y" = c("Red", "Red", "Red", "Green", "Green", "Red")
)
# Euclidean distance between points and c(0, 0, 0)
dist <- sqrt(dat[["x1"]]^2 + dat[["x2"]]^2 + dat[["x3"]]^2)
dat[["distance"]] <- c(signif(dist, 3))
dat
## x1 x2 x3 y distance
## 1 0 3 0 Red 3.00
## 2 2 0 0 Red 2.00
## 3 0 1 3 Red 3.16
## 4 0 1 2 Green 2.24
## 5 -1 0 1 Green 1.41
## 6 1 1 1 Red 1.73
knn <- function(k) {
names(which.max(table(dat[["y"]][order(dist)[1:k]])))
}
knn(1)
## [1] "Green"
knn(3)
## [1] "Red"
Red (based on data points 2, 5, 6)