1 Exercise 1

For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

Before we answering the questions, we should know a inflexible method is a simple method; a flexible method is a complex method.

The sample size n is extremely large, and the number of predictors p is small.

Better. The large number of observations is better because we can know more about the detailed correlation between them. Also, the number of small predictors and large sample sizes may help to avoid a overfitting problem and reduce the bias.

The number of predictors p is extremely large, and the number of observations n is small.

Worse. It is possible that not all predictors have a significant effect on the response variable. Overfitting and spurious correlation are phenomenons that affect data models with a large number of predictors, in which the data model performs well on training data but badly on test data. Furthermore, small sample sizes and too many predictors will result in high variance.

The relationship between the predictors and response is highly non-linear.

Better. If there are too many limits, just a few types of trends are conceivable, which may not convey the underlying nonlinear relationship. A better fit will result from having more degrees of freedom. Therefore, non-linearity allows predictors to be better fitted with the dependent variable.

The variance of the error terms, is extremely high.

Worse. Because of the high variance, an algorithm may model the random noise training data rather than the expected outputs. High error terms would decrease the model’s performance.

1.1 References:

2 Exercise 2

Describe the difference between a parametric and non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a non-parametric approach)? What are its disadvantages?

Parametric statistical approaches rely on assumptions about the underlying population’s distribution shape (i.e., a normal distribution) and the assumed distribution’s form or parameters (i.e., means and standard deviations). Nonparametric statistical approaches make few or no assumptions on the form of the population distribution from which the sample was taken. In other words, the distinction between parametric and non-parametric statistical learning approaches is based on assumptions.

Advantages of parametric statistical learning: Easy to figure out. If the each group is distinct, the parametric method can be highly effective.
Disadvantages of parametric statistical learning: If the data does not meet the assumptions, the parametric method could lead to incorrect conclusions. The assumption part is very important in parametric approaches. If the data is unable to overcome the assumption obstacle, consider transforming the data as a remedy. Also, small sample size (n < 30) is not good for parametric methods.
Advantages of non-parametric statistical learning: A good choice for data with a small sample size. There is no need to pass any assumption tasks ahead of time.
Disadvantages of non-parametric statistical learning: If the data is surely normal, non-parametric have less power for the sample size than parametric methods. Plus, non-parametric processes can likewise be more difficult to interpret than parametric processes.

2.1 References

3 Exercise 3

Carefully explain the the difference between the KNN classifier and KNN regression methods. Name a downside when using this model on very large data.

In KNN classification algorithm, the user wants to compute a categorical value, which is represented by an integer. For example, similarly to a dummy variable, we set 0 as male and 1 as female. The KNN classification algorithm will examine the k closest neighbors of the input we are attempting to predict. The most common result among the k samples will then be output.

In KNN algorithm for regression, the user wishes to output a numerical value, such as rent prices or cryptocurrency price predictions. The KNN algorithm would combine the values associated with the k nearest examples from the one on which we wish to make a prediction into a single value by taking an average or median to be a result.

When using a large data set, the prediction stage might be slow. It also requires high memory because we need to store all of the training data. Given that, it can also be computationally expensive.

3.1 References

4 Exercise 4

Suppose we have a data set with five predictors, \(X1\)= GPA, \(X2\)= extracurricular activities (EA), \(X3\)= Gender (1 for Female and 0 for Male), \(X4\)= Interaction between GPA and EA, and \(X5\)= Interaction between GPA and Gender. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get \(\beta_0\) = 50, \(\beta_1\) = 20, \(\beta_2\) = 0.07, \(\beta_3\) = 35, \(\beta_4\) = 0.01, \(\beta_5\) = −10.

The regression equation should be: \[ \hat{salary} = 50 + 20GPA + 0.07EA + 35Gender + 0.01GPA*EA -10GPA*Gender \]

As Gender is a dummy variable, the equation for females and males should be:

Female: gender is 1 \[ \hat{salary} = 50 + 20GPA + 0.07EA + 35 + 0.01GPA*EA -10GPA*Gender \] \[ \hat{salary} = 85 + 10GPA + 0.07EA + 0.01GPA*EA \]

Male: gender is 0 \[ \hat{salary} = 50 + 20GPA + 0.07EA + 0.01GPA*EA \] In other words, the difference between female and male is: \[ female - male = (85 + 10GPA + 0.07EA + 0.01GPA*EA) - (50 + 20GPA + 0.07EA + 0.01GPA*EA) \] \[ female - male = 35 - 10GPA \] a. Which answer is correct, and why?
1. For a fixed value of EA and GPA, males earn more on average than females.
2. For a fixed value of EA and GPA, females earn more on average than males.
3. For a fixed value of EA and GPA, males earn more on average than females provided that the GPA is high enough.
4. For a fixed value of EA and GPA, females earn more on average than males provided that the GPA is high enough.

3 is correct. Because the expected salary of female subtract male is 35 - 10 * GPA, we can know the female earn more 35 - 10 * mean(GPA) than male. According to the female-male equation, as the average GPA rises, the female will earn less. In conclusion, men will earn more than women if their GPA is higher.

Predict the salary of a female with EA of 110 and a GPA of 4.0.

137,100 dollars.

EA <- 110
GPA <- 4
f_salary <- (85 + 10*GPA + 0.07*EA + 0.01*GPA*EA) * 1000 # in thousand dollars
f_salary

## [1] 137100

True or false: Since the coefficient for the GPA/EA interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.

This is false. A small value of coefficient does not imply that the interaction term has a minor influence. The statistical significance of the coefficient can be determined by looking at the p-value in the coefficient table.

5 Exercise 5

This question should be answered using the biomass data set.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.6
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.1 ──

## ✓ broom     0.7.6      ✓ recipes   0.1.14
## ✓ dials     0.0.9      ✓ rsample   0.0.9 
## ✓ infer     0.5.3      ✓ tune      0.1.1 
## ✓ modeldata 0.1.0      ✓ workflows 0.2.1 
## ✓ parsnip   0.1.5      ✓ yardstick 0.0.7

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()

data("biomass")
biomass %>%
  head()

##                   sample  dataset carbon hydrogen oxygen nitrogen sulfur    HHV
## 1           Akhrot Shell Training  49.81     5.64  42.94     0.41   0.00 20.008
## 2 Alabama Oak Wood Waste Training  49.50     5.70  41.30     0.20   0.00 19.228
## 3                  Alder Training  47.82     5.80  46.25     0.11   0.02 18.299
## 4                Alfalfa Training  45.10     4.97  35.60     3.30   0.16 18.151
## 5     Alfalfa Seed Straw Training  46.76     5.40  40.72     1.00   0.02 18.450
## 6         Alfalfa Stalks Training  45.40     5.75  40.20     2.04   0.10 18.465

Fit a multiple regression model to predict HHV using carbon, hydrogen and oxygen.

# create a parsnip specification
linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm") -> lm_spec

# fit the model using tidymodel
lm_spec %>%
  fit(HHV ~ carbon + hydrogen + oxygen, data = biomass) -> lm_fit

# Another way to see the summary table
# tidy(lm_fit)

# predict HHV variable by using the `lm_fit` linear model
predict(lm_fit, new_data = biomass)

## # A tibble: 536 x 1
##    .pred
##    <dbl>
##  1  19.7
##  2  19.6
##  3  19.1
##  4  17.9
##  5  18.6
##  6  18.2
##  7  18.9
##  8  18.3
##  9  19.3
## 10  18.8
## # … with 526 more rows

Provide an interpretation of each coefficient in the model.

The p-value of intercept is 0.0952, meaning that we fail to reject the null.
The p-value of \(b_1\) is less than 2e-16, meaning that the carbon has an impact on HHV when holding hydrogen and oxygen constant.
The p-value of \(b_2\) is 0.0000986, meaning that the hydrogen has an impact on HHV when holding carbon and oxygen constant.
The p-value of \(b_3\) is 0.9638, which means the result is the same as the \(b_0\). That is, we have no evidence to reject the null hypothesis. So we can conclude the oxygen has no impact on HHV when holding carbon and hydrogen constant.
\(b_0\) is 1.0456860, \(b_1\) is 0.3478508, \(b_2\) is 0.2430900, and \(b_3\) is -0.0003767.

lm_fit %>%
  pluck("fit") %>%
  summary()

## 
## Call:
## stats::lm(formula = HHV ~ carbon + hydrogen + oxygen, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1375  -0.4980  -0.1059   0.3946  10.6878 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.0456860  0.6256343   1.671   0.0952 .  
## carbon       0.3478508  0.0078265  44.445  < 2e-16 ***
## hydrogen     0.2430900  0.0618804   3.928 9.68e-05 ***
## oxygen      -0.0003767  0.0082979  -0.045   0.9638    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.45 on 532 degrees of freedom
## Multiple R-squared:  0.8518, Adjusted R-squared:  0.8509 
## F-statistic:  1019 on 3 and 532 DF,  p-value: < 2.2e-16

Write out the model in equation form. \[ \hat{HHV} = 1.0456860 + 0.3478508carbon + 0.2430900hydrogen - 0.0003767oxygen \]
For which the predictors can you reject the null hypothesis H0:\(\beta_j = 0\)?

Hypothesis test: H0: \(\beta j=0\) vs Ha: \(\beta j \neq{0}\)

The p-value of \(b_1\) and \(b_2\) is 2e-16 and 0.0000986, so we have evidence to reject the null hypothesis in favor of the alternative hypothesis, meaning that \(b_1\) and \(b_2\) are not equal to 0. Thus, predictor carbon and hydrogen have an impact on HHV, when holding oxygen constant.

On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

We consider removing the predictor that does not meet the significance level based on the results of 5a. That is, we should eliminate the variable oxygen (because of no impact) then re-fit the linear model.

lm_spec %>%
  fit(HHV ~ carbon + hydrogen, data = biomass) -> lm_new_fit

lm_new_fit$fit %>%
  summary()

## 
## Call:
## stats::lm(formula = HHV ~ carbon + hydrogen, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1393  -0.4984  -0.1074   0.3934  10.6733 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.02853    0.49815   2.065   0.0394 *  
## carbon       0.34805    0.00646  53.882  < 2e-16 ***
## hydrogen     0.24180    0.05491   4.403 1.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.448 on 533 degrees of freedom
## Multiple R-squared:  0.8518, Adjusted R-squared:  0.8512 
## F-statistic:  1532 on 2 and 533 DF,  p-value: < 2.2e-16

How well do the models in (a) and (e) fit the data? How big was the effect of removing the predictor?

The intercept and oxygen are not significant in the 5a model, but all of the predictors and the intercept are significant in the 5e model. Plus, the adjusted R-squared is 0.8509 in 5a, which is less than the adjusted R-squared in 5e (0.8512). As a result, removing the predictor oxygen is a wise decision.

STAT-627 Assignment 1

Yunting Chiu

2021-05-23