For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.
Before we answering the questions, we should know a inflexible method is a simple method; a flexible method is a complex method.
The sample size n is extremely large, and the number of predictors p is small.
The number of predictors p is extremely large, and the number of observations n is small.
The relationship between the predictors and response is highly non-linear.
The variance of the error terms, is extremely high.
Describe the difference between a parametric and non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a non-parametric approach)? What are its disadvantages?
Parametric statistical approaches rely on assumptions about the underlying population’s distribution shape (i.e., a normal distribution) and the assumed distribution’s form or parameters (i.e., means and standard deviations). Nonparametric statistical approaches make few or no assumptions on the form of the population distribution from which the sample was taken. In other words, the distinction between parametric and non-parametric statistical learning approaches is based on assumptions.
Carefully explain the the difference between the KNN classifier and KNN regression methods. Name a downside when using this model on very large data.
In KNN classification algorithm, the user wants to compute a categorical value, which is represented by an integer. For example, similarly to a dummy variable, we set 0 as male and 1 as female. The KNN classification algorithm will examine the k closest neighbors of the input we are attempting to predict. The most common result among the k samples will then be output.
In KNN algorithm for regression, the user wishes to output a numerical value, such as rent prices or cryptocurrency price predictions. The KNN algorithm would combine the values associated with the k nearest examples from the one on which we wish to make a prediction into a single value by taking an average or median to be a result.
When using a large data set, the prediction stage might be slow. It also requires high memory because we need to store all of the training data. Given that, it can also be computationally expensive.
Suppose we have a data set with five predictors, \(X1\)= GPA, \(X2\)= extracurricular activities (EA), \(X3\)= Gender (1 for Female and 0 for Male), \(X4\)= Interaction between GPA and EA, and \(X5\)= Interaction between GPA and Gender. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get \(\beta_0\) = 50, \(\beta_1\) = 20, \(\beta_2\) = 0.07, \(\beta_3\) = 35, \(\beta_4\) = 0.01, \(\beta_5\) = −10.
The regression equation should be: \[ \hat{salary} = 50 + 20GPA + 0.07EA + 35Gender + 0.01GPA*EA -10GPA*Gender \]
As Gender is a dummy variable, the equation for females and males should be:
Female: gender is 1 \[ \hat{salary} = 50 + 20GPA + 0.07EA + 35 + 0.01GPA*EA -10GPA*Gender \] \[ \hat{salary} = 85 + 10GPA + 0.07EA + 0.01GPA*EA \]
Male: gender is 0 \[
\hat{salary} = 50 + 20GPA + 0.07EA + 0.01GPA*EA
\] In other words, the difference between female and male is: \[
female - male = (85 + 10GPA + 0.07EA + 0.01GPA*EA) - (50 + 20GPA + 0.07EA + 0.01GPA*EA)
\] \[
female - male = 35 - 10GPA
\] a. Which answer is correct, and why?
1. For a fixed value of EA and GPA, males earn more on average than females.
2. For a fixed value of EA and GPA, females earn more on average than males.
3. For a fixed value of EA and GPA, males earn more on average than females provided that the GPA is high enough.
4. For a fixed value of EA and GPA, females earn more on average than males provided that the GPA is high enough.
EA <- 110
GPA <- 4
f_salary <- (85 + 10*GPA + 0.07*EA + 0.01*GPA*EA) * 1000 # in thousand dollars
f_salary ## [1] 137100
This question should be answered using the biomass data set.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.6
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.1 ──
## ✓ broom 0.7.6 ✓ recipes 0.1.14
## ✓ dials 0.0.9 ✓ rsample 0.0.9
## ✓ infer 0.5.3 ✓ tune 0.1.1
## ✓ modeldata 0.1.0 ✓ workflows 0.2.1
## ✓ parsnip 0.1.5 ✓ yardstick 0.0.7
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
## sample dataset carbon hydrogen oxygen nitrogen sulfur HHV
## 1 Akhrot Shell Training 49.81 5.64 42.94 0.41 0.00 20.008
## 2 Alabama Oak Wood Waste Training 49.50 5.70 41.30 0.20 0.00 19.228
## 3 Alder Training 47.82 5.80 46.25 0.11 0.02 18.299
## 4 Alfalfa Training 45.10 4.97 35.60 3.30 0.16 18.151
## 5 Alfalfa Seed Straw Training 46.76 5.40 40.72 1.00 0.02 18.450
## 6 Alfalfa Stalks Training 45.40 5.75 40.20 2.04 0.10 18.465
HHV using carbon, hydrogen and oxygen.# create a parsnip specification
linear_reg() %>%
set_mode("regression") %>%
set_engine("lm") -> lm_spec
# fit the model using tidymodel
lm_spec %>%
fit(HHV ~ carbon + hydrogen + oxygen, data = biomass) -> lm_fit
# Another way to see the summary table
# tidy(lm_fit)
# predict HHV variable by using the `lm_fit` linear model
predict(lm_fit, new_data = biomass) ## # A tibble: 536 x 1
## .pred
## <dbl>
## 1 19.7
## 2 19.6
## 3 19.1
## 4 17.9
## 5 18.6
## 6 18.2
## 7 18.9
## 8 18.3
## 9 19.3
## 10 18.8
## # … with 526 more rows
carbon has an impact on HHV when holding hydrogen and oxygen constant.hydrogen has an impact on HHV when holding carbon and oxygen constant.oxygen has no impact on HHV when holding carbon and hydrogen constant.##
## Call:
## stats::lm(formula = HHV ~ carbon + hydrogen + oxygen, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1375 -0.4980 -0.1059 0.3946 10.6878
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0456860 0.6256343 1.671 0.0952 .
## carbon 0.3478508 0.0078265 44.445 < 2e-16 ***
## hydrogen 0.2430900 0.0618804 3.928 9.68e-05 ***
## oxygen -0.0003767 0.0082979 -0.045 0.9638
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.45 on 532 degrees of freedom
## Multiple R-squared: 0.8518, Adjusted R-squared: 0.8509
## F-statistic: 1019 on 3 and 532 DF, p-value: < 2.2e-16
Write out the model in equation form. \[ \hat{HHV} = 1.0456860 + 0.3478508carbon + 0.2430900hydrogen - 0.0003767oxygen \]
For which the predictors can you reject the null hypothesis H0:\(\beta_j = 0\)?
Hypothesis test: H0: \(\beta j=0\) vs Ha: \(\beta j \neq{0}\)
The p-value of \(b_1\) and \(b_2\) is 2e-16 and 0.0000986, so we have evidence to reject the null hypothesis in favor of the alternative hypothesis, meaning that \(b_1\) and \(b_2\) are not equal to 0. Thus, predictor carbon and hydrogen have an impact on HHV, when holding oxygen constant.
We consider removing the predictor that does not meet the significance level based on the results of 5a. That is, we should eliminate the variable oxygen (because of no impact) then re-fit the linear model.
##
## Call:
## stats::lm(formula = HHV ~ carbon + hydrogen, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1393 -0.4984 -0.1074 0.3934 10.6733
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.02853 0.49815 2.065 0.0394 *
## carbon 0.34805 0.00646 53.882 < 2e-16 ***
## hydrogen 0.24180 0.05491 4.403 1.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.448 on 533 degrees of freedom
## Multiple R-squared: 0.8518, Adjusted R-squared: 0.8512
## F-statistic: 1532 on 2 and 533 DF, p-value: < 2.2e-16
The intercept and oxygen are not significant in the 5a model, but all of the predictors and the intercept are significant in the 5e model. Plus, the adjusted R-squared is 0.8509 in 5a, which is less than the adjusted R-squared in 5e (0.8512). As a result, removing the predictor oxygen is a wise decision.