In this assignment, I will be using Tidymodels framework instead of base R.

1 Exercise 1 (10 points)

Explain the assumptions we are making when performing Principle Component Analysis (PCA). What happens when these assumptions are violated?

2 Exercise 2 (10 points)

Answer the following questions regarding Principle Component Analysis.

Ans: Yes. One assumption of PCA is that we assume that each of the variables in the data set has been centered to have mean zero so it is important to standardize the data set before applying PCA.

Ans: No. Because PCA is an unsupervised learning technique, and it is for exploratory data analysis. There is no need to remove highly correlated variables. We only need to make sure that the new components are uncorrelated with the preceding components (e.g. \(Z_2\) uncorrelated with \(Z_1\)).

Ans: The variances of the components are the eigenvalues. If eigenvalues are roughly equal, it means that the variances of the variables are roughly equal. In that case, the reduction of the data set via PCA into a smaller dimensional subspace might NOT be ideal since it is hard to distinguish “more informative” components (e.g. the variable that accounts for the most variation in the data set).

Ans: PCA works better for linear data because one of its assumptions is that it assumes the data set to be linear combinations of the variables. However, some techniques can be applied for nonlinear PCA.

3 Exercise 3 (10 points)

You will in this exercise explore a data set using PCA. The data comes from the #tidytuesday project and is about Student Loan Payments.

Load in the data using the following script.

loans <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-11-26/loans.csv") %>%
  select(-agency_name, -added) %>%
  drop_na()
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   agency_name = col_character(),
##   year = col_double(),
##   quarter = col_double(),
##   starting = col_double(),
##   added = col_double(),
##   total = col_double(),
##   consolidation = col_double(),
##   rehabilitation = col_double(),
##   voluntary_payments = col_double(),
##   wage_garnishments = col_double()
## )
  1. Use the prcomp() function to perform PCA on the loans data set. Set scale. = TRUE to perform scaling. What results are contained in this object? (hint: use the names() function)

loans data set has the following variables

names(loans)
## [1] "year"               "quarter"            "starting"          
## [4] "total"              "consolidation"      "rehabilitation"    
## [7] "voluntary_payments" "wage_garnishments"

Make sure we need to remove all non-numeric columns in advance. Sure, we can go ahead.

str(loans)
## tibble [278 × 8] (S3: tbl_df/tbl/data.frame)
##  $ year              : num [1:278] 15 15 15 15 15 15 15 15 15 15 ...
##  $ quarter           : num [1:278] 4 4 4 4 4 4 4 4 4 4 ...
##  $ starting          : num [1:278] 5.81e+09 3.69e+09 2.36e+09 7.04e+08 2.95e+09 ...
##  $ total             : num [1:278] 1.23e+08 1.13e+08 8.39e+07 9.96e+07 7.57e+07 ...
##  $ consolidation     : num [1:278] 20081894 11533809 7377703 3401361 8946976 ...
##  $ rehabilitation    : num [1:278] 90952573 86967994 64227391 85960328 58395653 ...
##  $ voluntary_payments: num [1:278] 5485507 4885225 3939866 2509000 2909013 ...
##  $ wage_garnishments : num [1:278] 6082668 9939819 8308043 7773214 5473844 ...

3.1 Principal Component Analysis

Performing PCA on the loans data set

loans_pca <- prcomp(~., data = loans, scale. = TRUE) 
loans_pca
## Standard deviations (1, .., p=8):
## [1] 2.280109e+00 1.092617e+00 8.806929e-01 7.109859e-01 4.911129e-01
## [6] 2.298129e-01 1.793515e-01 3.283268e-16
## 
## Rotation (n x k) = (8 x 8):
##                           PC1          PC2         PC3         PC4         PC5
## year               0.15481630 -0.609154282  0.71098192  0.25713986 -0.16708786
## quarter            0.03100797  0.786863976  0.52892399  0.24341051 -0.19005106
## starting           0.43098287  0.020871379  0.04127974 -0.09648700 -0.00345036
## total              0.42241658  0.007288533 -0.21562217  0.06348967 -0.36946649
## consolidation      0.38135985  0.049807680  0.18110377 -0.60881252  0.21861788
## rehabilitation     0.40387031 -0.005538468 -0.28934855  0.13556608 -0.56400604
## voluntary_payments 0.41562996  0.079102751  0.13888702 -0.21645279  0.29164295
## wage_garnishments  0.35999109  0.022774599 -0.17531809  0.65223239  0.59033267
##                             PC6         PC7           PC8
## year                0.008145368  0.07322624  5.460575e-17
## quarter             0.046285326  0.05111353  2.013611e-16
## starting            0.307381188 -0.84161113 -4.778782e-17
## total              -0.010590898  0.19628903 -7.719069e-01
## consolidation       0.445114306  0.43687990  1.088487e-01
## rehabilitation     -0.068857583  0.15411095  6.230273e-01
## voluntary_payments -0.816160793 -0.05285133  2.976180e-02
## wage_garnishments   0.184957853  0.16667081  5.712260e-02
  1. Calculate the amount of variance explained by each principal component. (hint: look at ?broom::tidy.prcomp)

Standard deviation = square root of the variance, so the variance by each principal component will be:

loans_pca %>%
  tidy(matrix = "pcs") %>%
  mutate(Variance = (std.dev)^2) %>%
  select(PC, Variance)
## # A tibble: 8 x 2
##      PC Variance
##   <dbl>    <dbl>
## 1     1 5.20e+ 0
## 2     2 1.19e+ 0
## 3     3 7.76e- 1
## 4     4 5.06e- 1
## 5     5 2.41e- 1
## 6     6 5.28e- 2
## 7     7 3.22e- 2
## 8     8 1.08e-31
  1. Use the tidy() function to extract the loadings. Which variable contributed most to the first principle component? Second Component?
  • starting contributes the most in PC1.
  • quarter contributes the most in PC2.
loans_pca %>%
  tidy(matrix = "loadings") %>%
  filter(PC == 1 | PC == 2) %>%
  ggplot(aes(value, column)) +
  geom_col() +
  facet_wrap(~PC) +
  theme_bw()

  1. Use the augment() function to get back the transformation and create a scatter plot of any two components of your choice.

The fitted PC1 against PC2 coordinates for each quarter of the year are shown below.

augment(loans_pca, newdata = loans) %>%
  ggplot(aes(x = .fittedPC1, y = .fittedPC2, color = factor(quarter))) +
  geom_point() +
  theme_bw() +
  labs(color = "Quarter") +
  ggtitle("PC1 against PC2")

4 Exercise 4 (15 points)

In this exercise, you are tasked to predict the weight of an animal in a zoo, based on which words are used to describe it. The animals data set can be downloaded here.

Read the data set. We can see there are 801 variables with 479 observations!!!

animals <- read_csv("./data/animals.csv") %>%
  filter(!is.na(weight))
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
dim(animals)
## [1] 479 801
# str(animals) # all variables are numerical

This data set contains 801 variables. The first variable weight is the natural log of the mean weight of the animal. The remaining variables are named tf_* which shows how many times the word * appears in the description of the animal.

Use {tidymodels} to set up a workflow to train a PC regression. We can do this by specifying a linear regression model, and create a preprocessor recipe with {recipes} that applies PCA transformation on the predictors using step_pca(). Use the threshold argument in step_pca() to only keep the principal components that explain 90% of the variance.

Setting a seed and splitting the data to the training and testing sets

set.seed(1234)
animals_split <- initial_split(animals, strata = "weight")
animals_train <- training(animals_split)
animals_test <- testing(animals_split)

Customize a recipe for the question

rec_spec <- recipe(weight ~., data = animals_train) %>%
  step_normalize(all_predictors()) %>%
  step_pca(all_predictors(), threshold = 0.9)
rec_spec 
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor        800
## 
## Operations:
## 
## Centering and scaling for all_predictors()
## No PCA components were extracted.

Construct a linear regression specification

lm_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")
lm_spec
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Construct the model workflow

pcr_wf <- workflow() %>%
  add_recipe(rec_spec) %>%
  add_model(lm_spec)
pcr_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## • step_normalize()
## • step_pca()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

4.1 Model Fitting

Fit the model

pcr_fit <- fit(pcr_wf, data = animals_train)

How well does this model perform on the testing data set?

Lower value of RMSE indicates a better fit. In the initial stage, we want to keep the principal components that explain 90% of the variance so we set threshold = 0.9 in step_pca(). The algorithm will generate enough components to capture 90 % of the variability in the variables. Because we have 90% variables, the rmse will be small when compared to threshold = 0.1.

augment(pcr_fit, new_data = animals_test) %>%
  rmse(truth = weight, estimate = .pred) # root mean squared error
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        2.67

4.2 Visualization

Because Principal Component Analysis is a linear method, the blue line can help us read the plot more easily. As of now, the animals data set contains 800 predictors, and PCR is effective at reducing the dimensionality of large data sets. However, the plot indicates the PCR model doesn’t have a good fit.

augment(pcr_fit, new_data = animals_test) %>%
  ggplot(aes(weight, .pred)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = "blue") +
  coord_fixed() +
  theme_bw()

5 Exercise 5 (10 points)

For part (a) through (c) indicate which of the statements are correct. Justify your answers.

  1. The lasso, relative to least squares, is:
    1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
    2. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
    3. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
    4. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

5.1 The lasso (L1 Regularization), relative to least squares is

3 is correct. Compared to least squares models, the lasso is a restrictive model. Lasso regression can shrunk coefficients to 0 exactly (unless \(\lambda = \infty\)). In comparison to least squares, it penalizes the coefficients of non-essential variables to produce a model with a lower variance and higher bias.

5.2 The ridge (L2 Regularization), relative to least squares is

  1. Repeat (a) for ridge regression relative to least squares.

3 is correct. Because coefficients tend towards zero, variance decreases and bias increases, causing ridge regression less flexible than least squares regression. As the coefficient estimates shrink, we can obtain a relatively better MSE in ridge regression. The least squares coefficient results in a larger variance value. Ridge regression, on the other hand, can still perform well by trading off a increase in bias and reduce in variance.

5.3 The non-linear methods, relative to least squares is

  1. Repeat (a) for non-linear methods relative to least squares.

2 is correct. Non-linear models have more flexibility and less bias than least squares models. Non-linear models perform better when the linearity assumption is violated. Due to Non-linear models more sensitive fits to the underlying data, these approaches will have higher variation and will require a substantial reduction in bias to work well.

6 Exercise 6 (10 points)

Suppose we estimate the regression coefficients in a linear regression model by minimizing \[ \sum_{i=1}^n \left( y_i - \beta_0 - \sum^p_{j=1}\beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^p \beta_j^2 \]

for a particular value of \(\lambda\). For part (a) through (c) indicate which of the statements are correct. Justify your answers.

  1. As we increase \(\lambda\) from 0, the training RSS will:
    • Increase initially, and then eventually start decreasing in an inverted U shape.
    • Decrease initially, and then eventually start increasing in a U shape.
    • Steadily increase.
    • Steadily decrease.
    • Remain constant.

As we increase \(\lambda\) from 0, the impact of the shrinkage penalty grows, meaning that we want smaller and smaller ridge regression coefficients in order to achieve the objection of minimization. This puts more and more restrictions on the coefficients. Therefore the training RSS will steadily increase.

  1. Repeat (a) for test RSS.

As we increase \(\lambda\) from 0, the test RSS will decrease initially and increase following a U shape.

  1. Repeat (a) for variance.

As we increase \(\lambda\) from 0, for the same reason, the model is becoming less and less flexible, meaning that the coefficients are getting smaller and smaller. Therefore, the variance will steadily decrease.

  1. Repeat (a) for squared bias.

As we increase \(\lambda\) from 0, the impact of the shrinkage penalty grows, meaning that we are putting more and more restrictions on the coefficients. Therefore, there will be a steady increase in bias.

  1. Repeat (a) for the irreducible error.

The irreducible error remains constant because by definition, the irreducible error is independent of the model, so it is independent of \(\lambda\) as well.

7 Exercise 7 (15 points)

In this exercise, you are tasked to predict the weight of an animal in a zoo, based on which words are used to describe it. The animals data set can be downloaded here.

This data set contains 801 variables. The first variable weight is the natural log of the mean weight of the animal. The remaining variables are named tf_* which shows how many times the word * appears in the description of the animal.

  1. Fit a lasso regression model to predict weight based on all the other variables.

  2. Use the tune package to perform hyperparameter tuning to select the best value of \(\lambda\). Use 10 bootstraps as the resamples data set.

  3. How well does this model perform on the testing data set?

7.1 Lasso Regression

Recap the data set

animals %>%
  head()
## # A tibble: 6 x 801
##   weight  tf_1 tf_10 tf_100 tf_11 tf_12 tf_15 tf_18  tf_2 tf_20 tf_25  tf_3
##    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1   4.25     0     1      0     0     0     1     0     1     1     0     0
## 2   1.50     0     1      0     0     0     0     0     1     0     0     1
## 3   8.41     0     2      1     1     1     0     0     2     1     1     1
## 4   1.08     0     0      0     0     0     1     0     0     1     0     0
## 5  -2.10     0     0      0     0     1     1     0     0     0     0     0
## 6   7.58     0     1      1     1     1     0     0     2     1     0     1
## # … with 789 more variables: tf_30 <dbl>, tf_4 <dbl>, tf_40 <dbl>, tf_5 <dbl>,
## #   tf_50 <dbl>, tf_6 <dbl>, tf_60 <dbl>, tf_7 <dbl>, tf_70 <dbl>, tf_8 <dbl>,
## #   tf_80 <dbl>, tf_ability <dbl>, tf_able <dbl>, tf_according <dbl>,
## #   tf_across <dbl>, tf_active <dbl>, tf_activity <dbl>, tf_actually <dbl>,
## #   tf_adapted <dbl>, tf_addition <dbl>, tf_adult <dbl>, tf_adults <dbl>,
## #   tf_affected <dbl>, tf_africa <dbl>, tf_african <dbl>, tf_age <dbl>,
## #   tf_aggressive <dbl>, tf_ago <dbl>, tf_agriculture <dbl>, tf_air <dbl>,
## #   tf_allow <dbl>, tf_allows <dbl>, tf_almost <dbl>, tf_alone <dbl>,
## #   tf_along <dbl>, tf_also <dbl>, tf_although <dbl>, tf_always <dbl>,
## #   tf_america <dbl>, tf_american <dbl>, tf_among <dbl>, tf_amongst <dbl>,
## #   tf_amount <dbl>, tf_anatomy <dbl>, tf_ancient <dbl>, tf_animal <dbl>,
## #   tf_animals <dbl>, tf_another <dbl>, tf_anything <dbl>, tf_anywhere <dbl>,
## #   tf_appear <dbl>, tf_appearance <dbl>, tf_approximately <dbl>,
## #   tf_aquatic <dbl>, tf_area <dbl>, tf_areas <dbl>, tf_around <dbl>,
## #   tf_asia <dbl>, tf_asian <dbl>, tf_attack <dbl>, tf_attract <dbl>,
## #   tf_australia <dbl>, tf_available <dbl>, tf_average <dbl>, tf_avoid <dbl>,
## #   tf_away <dbl>, tf_babies <dbl>, tf_baby <dbl>, tf_back <dbl>,
## #   tf_based <dbl>, tf_beak <dbl>, tf_become <dbl>, tf_becoming <dbl>,
## #   tf_begin <dbl>, tf_begins <dbl>, tf_behavior <dbl>, tf_behaviour <dbl>,
## #   tf_behind <dbl>, tf_believe <dbl>, tf_believed <dbl>, tf_belong <dbl>,
## #   tf_berries <dbl>, tf_best <dbl>, tf_better <dbl>, tf_big <dbl>,
## #   tf_bigger <dbl>, tf_biggest <dbl>, tf_bird <dbl>, tf_birds <dbl>,
## #   tf_birth <dbl>, tf_bite <dbl>, tf_black <dbl>, tf_blind <dbl>,
## #   tf_bodies <dbl>, tf_body <dbl>, tf_born <dbl>, tf_branches <dbl>,
## #   tf_bred <dbl>, tf_breed <dbl>, tf_breeding <dbl>, …

mixture = 1 indicates that it is a lasso model.

# engine specification
lasso_spec <- linear_reg(mixture = 1, penalty = tune()) %>%
  set_mode("regression") %>%
  set_engine("glmnet")

Put in our ingredients and get a recipe. Because lasso regression is scale sensitive, we must ensure that the variables are on the same scale by using step_normalize(all_predictors()).

# customize a recipe
lasso_rec <- recipe(weight ~., data = animals_train) %>%
  step_normalize(all_predictors()) %>%
  step_zv(all_predictors()) 

7.2 Workflow

lasso_wf <- workflow() %>%
  add_model(lasso_spec) %>%
  add_recipe(lasso_rec)

Create a bootstrap term in order to use in the following tune_grid() session. We use 10 bootstraps.

set.seed(4321)
animals_boots <- bootstraps(animals_train, strata = weight, time = 10)
animals_boots$splits
## [[1]]
## <Analysis/Assess/Total>
## <361/132/361>
## 
## [[2]]
## <Analysis/Assess/Total>
## <361/131/361>
## 
## [[3]]
## <Analysis/Assess/Total>
## <361/119/361>
## 
## [[4]]
## <Analysis/Assess/Total>
## <361/136/361>
## 
## [[5]]
## <Analysis/Assess/Total>
## <361/134/361>
## 
## [[6]]
## <Analysis/Assess/Total>
## <361/134/361>
## 
## [[7]]
## <Analysis/Assess/Total>
## <361/122/361>
## 
## [[8]]
## <Analysis/Assess/Total>
## <361/139/361>
## 
## [[9]]
## <Analysis/Assess/Total>
## <361/135/361>
## 
## [[10]]
## <Analysis/Assess/Total>
## <361/126/361>

Regularly predict the penalty 50 times using regular grids, with the penalty range limited to \(10^{-5}\) to \(0\). Note, these are in transformed units, the default transformation is \(log10\).

penalty_grid <- grid_regular(penalty(range = c(-5, 0)), levels = 50)

7.3 Tune

tune_res <- tune_grid(object = lasso_wf, resamples = animals_boots,
                      grid = penalty_grid)
tune_res
## # Tuning results
## # Bootstrap sampling using stratification 
## # A tibble: 10 x 4
##    splits            id          .metrics           .notes          
##    <list>            <chr>       <list>             <list>          
##  1 <split [361/132]> Bootstrap01 <tibble [100 × 5]> <tibble [0 × 1]>
##  2 <split [361/131]> Bootstrap02 <tibble [100 × 5]> <tibble [0 × 1]>
##  3 <split [361/119]> Bootstrap03 <tibble [100 × 5]> <tibble [0 × 1]>
##  4 <split [361/136]> Bootstrap04 <tibble [100 × 5]> <tibble [0 × 1]>
##  5 <split [361/134]> Bootstrap05 <tibble [100 × 5]> <tibble [0 × 1]>
##  6 <split [361/134]> Bootstrap06 <tibble [100 × 5]> <tibble [0 × 1]>
##  7 <split [361/122]> Bootstrap07 <tibble [100 × 5]> <tibble [0 × 1]>
##  8 <split [361/139]> Bootstrap08 <tibble [100 × 5]> <tibble [0 × 1]>
##  9 <split [361/135]> Bootstrap09 <tibble [100 × 5]> <tibble [0 × 1]>
## 10 <split [361/126]> Bootstrap10 <tibble [100 × 5]> <tibble [0 × 1]>

Display the each penalty on rmse and rsq, respectively.

tune_res %>%
  collect_metrics()
## # A tibble: 100 x 7
##      penalty .metric .estimator  mean     n std_err .config              
##        <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1 0.00001   rmse    standard   4.02     10  0.317  Preprocessor1_Model01
##  2 0.00001   rsq     standard   0.291    10  0.0188 Preprocessor1_Model01
##  3 0.0000126 rmse    standard   4.02     10  0.317  Preprocessor1_Model02
##  4 0.0000126 rsq     standard   0.291    10  0.0188 Preprocessor1_Model02
##  5 0.0000160 rmse    standard   4.02     10  0.317  Preprocessor1_Model03
##  6 0.0000160 rsq     standard   0.291    10  0.0188 Preprocessor1_Model03
##  7 0.0000202 rmse    standard   4.02     10  0.317  Preprocessor1_Model04
##  8 0.0000202 rsq     standard   0.291    10  0.0188 Preprocessor1_Model04
##  9 0.0000256 rmse    standard   4.02     10  0.317  Preprocessor1_Model05
## 10 0.0000256 rsq     standard   0.291    10  0.0188 Preprocessor1_Model05
## # … with 90 more rows

Display the best rmse and rsq values of penalty

tune_res %>%
  show_best(metric = "rsq") %>%
  head(1)
## # A tibble: 1 x 7
##   penalty .metric .estimator  mean     n std_err .config              
##     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1   0.244 rsq     standard   0.387    10  0.0216 Preprocessor1_Model44
tune_res %>%
  show_best(metric = "rmse") %>%
  head(1)
## # A tibble: 1 x 7
##   penalty .metric .estimator  mean     n std_err .config              
##     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1   0.244 rmse    standard    2.86    10   0.165 Preprocessor1_Model44

We can see that if the amount of regularization is close to 0.25, the rmse is low and the rsq is high, on average. Thus, the best hyperparameter of this model should be here.

tune_res %>%
  autoplot() +
  geom_vline(xintercept = 0.2442053, color = "red") 

7.4 Best Hyperparameter

The best value of \(\lambda\) is 0.244.

best_rmse <- select_best(tune_res, metric = "rmse")
best_rmse
## # A tibble: 1 x 2
##   penalty .config              
##     <dbl> <chr>                
## 1   0.244 Preprocessor1_Model44
lasso_final <- finalize_workflow(lasso_wf, best_rmse)
lasso_final_fit <- fit(lasso_final, data = animals_train) 

7.5 Peformance

Fit a lasso regression model to predict weight based on all the other variables. With the rmse value, we can conclude that the model performs mediocrity. However, we can carry out additional comparisons, such as canceling the penalty term in the model, to determine whether the good performance is due to hyperparameters or the model itself.

augment(lasso_final_fit, new_data = animals_test) %>%
  rmse(truth = weight, estimate = .pred)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        2.92

7.6 Visualization

augment(lasso_final_fit, new_data = animals_test) %>%
  ggplot(aes(weight, .pred)) +
  geom_abline(slope = 1, intercept = 0) +
  geom_point() +
  theme_bw() +
  ggtitle("Lasso Prediction of Weight")

The weight range is from -8.111728 to 10.491274 from the original data set, and based on the plot above, we can conclude that the model does not work well; tuning \(\lambda\) appears to have no obvious benefit in the model.

animals %>%
  select(weight) %>%
  range()
## [1] -8.111728 10.491274

8 References