In this assignment, I will be using Tidymodels framework instead of base R.
Explain the assumptions we are making when performing Principle Component Analysis (PCA). What happens when these assumptions are violated?
Answer the following questions regarding Principle Component Analysis.
Ans: Yes. One assumption of PCA is that we assume that each of the variables in the data set has been centered to have mean zero so it is important to standardize the data set before applying PCA.
Ans: No. Because PCA is an unsupervised learning technique, and it is for exploratory data analysis. There is no need to remove highly correlated variables. We only need to make sure that the new components are uncorrelated with the preceding components (e.g. \(Z_2\) uncorrelated with \(Z_1\)).
Ans: The variances of the components are the eigenvalues. If eigenvalues are roughly equal, it means that the variances of the variables are roughly equal. In that case, the reduction of the data set via PCA into a smaller dimensional subspace might NOT be ideal since it is hard to distinguish “more informative” components (e.g. the variable that accounts for the most variation in the data set).
Ans: PCA works better for linear data because one of its assumptions is that it assumes the data set to be linear combinations of the variables. However, some techniques can be applied for nonlinear PCA.
You will in this exercise explore a data set using PCA. The data comes from the #tidytuesday project and is about Student Loan Payments.
Load in the data using the following script.
loans <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-11-26/loans.csv") %>%
select(-agency_name, -added) %>%
drop_na()##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## agency_name = col_character(),
## year = col_double(),
## quarter = col_double(),
## starting = col_double(),
## added = col_double(),
## total = col_double(),
## consolidation = col_double(),
## rehabilitation = col_double(),
## voluntary_payments = col_double(),
## wage_garnishments = col_double()
## )
prcomp() function to perform PCA on the loans data set. Set scale. = TRUE to perform scaling. What results are contained in this object? (hint: use the names() function)loans data set has the following variables
names(loans)## [1] "year" "quarter" "starting"
## [4] "total" "consolidation" "rehabilitation"
## [7] "voluntary_payments" "wage_garnishments"
Make sure we need to remove all non-numeric columns in advance. Sure, we can go ahead.
str(loans)## tibble [278 × 8] (S3: tbl_df/tbl/data.frame)
## $ year : num [1:278] 15 15 15 15 15 15 15 15 15 15 ...
## $ quarter : num [1:278] 4 4 4 4 4 4 4 4 4 4 ...
## $ starting : num [1:278] 5.81e+09 3.69e+09 2.36e+09 7.04e+08 2.95e+09 ...
## $ total : num [1:278] 1.23e+08 1.13e+08 8.39e+07 9.96e+07 7.57e+07 ...
## $ consolidation : num [1:278] 20081894 11533809 7377703 3401361 8946976 ...
## $ rehabilitation : num [1:278] 90952573 86967994 64227391 85960328 58395653 ...
## $ voluntary_payments: num [1:278] 5485507 4885225 3939866 2509000 2909013 ...
## $ wage_garnishments : num [1:278] 6082668 9939819 8308043 7773214 5473844 ...
Performing PCA on the loans data set
loans_pca <- prcomp(~., data = loans, scale. = TRUE)
loans_pca## Standard deviations (1, .., p=8):
## [1] 2.280109e+00 1.092617e+00 8.806929e-01 7.109859e-01 4.911129e-01
## [6] 2.298129e-01 1.793515e-01 3.283268e-16
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## year 0.15481630 -0.609154282 0.71098192 0.25713986 -0.16708786
## quarter 0.03100797 0.786863976 0.52892399 0.24341051 -0.19005106
## starting 0.43098287 0.020871379 0.04127974 -0.09648700 -0.00345036
## total 0.42241658 0.007288533 -0.21562217 0.06348967 -0.36946649
## consolidation 0.38135985 0.049807680 0.18110377 -0.60881252 0.21861788
## rehabilitation 0.40387031 -0.005538468 -0.28934855 0.13556608 -0.56400604
## voluntary_payments 0.41562996 0.079102751 0.13888702 -0.21645279 0.29164295
## wage_garnishments 0.35999109 0.022774599 -0.17531809 0.65223239 0.59033267
## PC6 PC7 PC8
## year 0.008145368 0.07322624 5.460575e-17
## quarter 0.046285326 0.05111353 2.013611e-16
## starting 0.307381188 -0.84161113 -4.778782e-17
## total -0.010590898 0.19628903 -7.719069e-01
## consolidation 0.445114306 0.43687990 1.088487e-01
## rehabilitation -0.068857583 0.15411095 6.230273e-01
## voluntary_payments -0.816160793 -0.05285133 2.976180e-02
## wage_garnishments 0.184957853 0.16667081 5.712260e-02
?broom::tidy.prcomp)Standard deviation = square root of the variance, so the variance by each principal component will be:
loans_pca %>%
tidy(matrix = "pcs") %>%
mutate(Variance = (std.dev)^2) %>%
select(PC, Variance)## # A tibble: 8 x 2
## PC Variance
## <dbl> <dbl>
## 1 1 5.20e+ 0
## 2 2 1.19e+ 0
## 3 3 7.76e- 1
## 4 4 5.06e- 1
## 5 5 2.41e- 1
## 6 6 5.28e- 2
## 7 7 3.22e- 2
## 8 8 1.08e-31
tidy() function to extract the loadings. Which variable contributed most to the first principle component? Second Component?starting contributes the most in PC1.quarter contributes the most in PC2.loans_pca %>%
tidy(matrix = "loadings") %>%
filter(PC == 1 | PC == 2) %>%
ggplot(aes(value, column)) +
geom_col() +
facet_wrap(~PC) +
theme_bw()augment() function to get back the transformation and create a scatter plot of any two components of your choice.The fitted PC1 against PC2 coordinates for each quarter of the year are shown below.
augment(loans_pca, newdata = loans) %>%
ggplot(aes(x = .fittedPC1, y = .fittedPC2, color = factor(quarter))) +
geom_point() +
theme_bw() +
labs(color = "Quarter") +
ggtitle("PC1 against PC2")In this exercise, you are tasked to predict the weight of an animal in a zoo, based on which words are used to describe it. The animals data set can be downloaded here.
Read the data set. We can see there are 801 variables with 479 observations!!!
animals <- read_csv("./data/animals.csv") %>%
filter(!is.na(weight))##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
dim(animals)## [1] 479 801
# str(animals) # all variables are numericalThis data set contains 801 variables. The first variable weight is the natural log of the mean weight of the animal. The remaining variables are named tf_* which shows how many times the word * appears in the description of the animal.
Use {tidymodels} to set up a workflow to train a PC regression. We can do this by specifying a linear regression model, and create a preprocessor recipe with {recipes} that applies PCA transformation on the predictors using step_pca(). Use the threshold argument in step_pca() to only keep the principal components that explain 90% of the variance.
Setting a seed and splitting the data to the training and testing sets
set.seed(1234)
animals_split <- initial_split(animals, strata = "weight")
animals_train <- training(animals_split)
animals_test <- testing(animals_split)Customize a recipe for the question
rec_spec <- recipe(weight ~., data = animals_train) %>%
step_normalize(all_predictors()) %>%
step_pca(all_predictors(), threshold = 0.9)
rec_spec ## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 800
##
## Operations:
##
## Centering and scaling for all_predictors()
## No PCA components were extracted.
Construct a linear regression specification
lm_spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
lm_spec## Linear Regression Model Specification (regression)
##
## Computational engine: lm
Construct the model workflow
pcr_wf <- workflow() %>%
add_recipe(rec_spec) %>%
add_model(lm_spec)
pcr_wf## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
##
## • step_normalize()
## • step_pca()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
Fit the model
pcr_fit <- fit(pcr_wf, data = animals_train)How well does this model perform on the testing data set?
Lower value of RMSE indicates a better fit. In the initial stage, we want to keep the principal components that explain 90% of the variance so we set threshold = 0.9 in step_pca(). The algorithm will generate enough components to capture 90 % of the variability in the variables. Because we have 90% variables, the rmse will be small when compared to threshold = 0.1.
augment(pcr_fit, new_data = animals_test) %>%
rmse(truth = weight, estimate = .pred) # root mean squared error## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 2.67
Because Principal Component Analysis is a linear method, the blue line can help us read the plot more easily. As of now, the animals data set contains 800 predictors, and PCR is effective at reducing the dimensionality of large data sets. However, the plot indicates the PCR model doesn’t have a good fit.
augment(pcr_fit, new_data = animals_test) %>%
ggplot(aes(weight, .pred)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, color = "blue") +
coord_fixed() +
theme_bw()For part (a) through (c) indicate which of the statements are correct. Justify your answers.
3 is correct. Compared to least squares models, the lasso is a restrictive model. Lasso regression can shrunk coefficients to 0 exactly (unless \(\lambda = \infty\)). In comparison to least squares, it penalizes the coefficients of non-essential variables to produce a model with a lower variance and higher bias.
3 is correct. Because coefficients tend towards zero, variance decreases and bias increases, causing ridge regression less flexible than least squares regression. As the coefficient estimates shrink, we can obtain a relatively better MSE in ridge regression. The least squares coefficient results in a larger variance value. Ridge regression, on the other hand, can still perform well by trading off a increase in bias and reduce in variance.
2 is correct. Non-linear models have more flexibility and less bias than least squares models. Non-linear models perform better when the linearity assumption is violated. Due to Non-linear models more sensitive fits to the underlying data, these approaches will have higher variation and will require a substantial reduction in bias to work well.
Suppose we estimate the regression coefficients in a linear regression model by minimizing \[ \sum_{i=1}^n \left( y_i - \beta_0 - \sum^p_{j=1}\beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^p \beta_j^2 \]
for a particular value of \(\lambda\). For part (a) through (c) indicate which of the statements are correct. Justify your answers.
As we increase \(\lambda\) from 0, the impact of the shrinkage penalty grows, meaning that we want smaller and smaller ridge regression coefficients in order to achieve the objection of minimization. This puts more and more restrictions on the coefficients. Therefore the training RSS will steadily increase.
As we increase \(\lambda\) from 0, the test RSS will decrease initially and increase following a U shape.
As we increase \(\lambda\) from 0, for the same reason, the model is becoming less and less flexible, meaning that the coefficients are getting smaller and smaller. Therefore, the variance will steadily decrease.
As we increase \(\lambda\) from 0, the impact of the shrinkage penalty grows, meaning that we are putting more and more restrictions on the coefficients. Therefore, there will be a steady increase in bias.
The irreducible error remains constant because by definition, the irreducible error is independent of the model, so it is independent of \(\lambda\) as well.
In this exercise, you are tasked to predict the weight of an animal in a zoo, based on which words are used to describe it. The animals data set can be downloaded here.
This data set contains 801 variables. The first variable weight is the natural log of the mean weight of the animal. The remaining variables are named tf_* which shows how many times the word * appears in the description of the animal.
Fit a lasso regression model to predict weight based on all the other variables.
Use the tune package to perform hyperparameter tuning to select the best value of \(\lambda\). Use 10 bootstraps as the resamples data set.
How well does this model perform on the testing data set?
Recap the data set
animals %>%
head()## # A tibble: 6 x 801
## weight tf_1 tf_10 tf_100 tf_11 tf_12 tf_15 tf_18 tf_2 tf_20 tf_25 tf_3
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4.25 0 1 0 0 0 1 0 1 1 0 0
## 2 1.50 0 1 0 0 0 0 0 1 0 0 1
## 3 8.41 0 2 1 1 1 0 0 2 1 1 1
## 4 1.08 0 0 0 0 0 1 0 0 1 0 0
## 5 -2.10 0 0 0 0 1 1 0 0 0 0 0
## 6 7.58 0 1 1 1 1 0 0 2 1 0 1
## # … with 789 more variables: tf_30 <dbl>, tf_4 <dbl>, tf_40 <dbl>, tf_5 <dbl>,
## # tf_50 <dbl>, tf_6 <dbl>, tf_60 <dbl>, tf_7 <dbl>, tf_70 <dbl>, tf_8 <dbl>,
## # tf_80 <dbl>, tf_ability <dbl>, tf_able <dbl>, tf_according <dbl>,
## # tf_across <dbl>, tf_active <dbl>, tf_activity <dbl>, tf_actually <dbl>,
## # tf_adapted <dbl>, tf_addition <dbl>, tf_adult <dbl>, tf_adults <dbl>,
## # tf_affected <dbl>, tf_africa <dbl>, tf_african <dbl>, tf_age <dbl>,
## # tf_aggressive <dbl>, tf_ago <dbl>, tf_agriculture <dbl>, tf_air <dbl>,
## # tf_allow <dbl>, tf_allows <dbl>, tf_almost <dbl>, tf_alone <dbl>,
## # tf_along <dbl>, tf_also <dbl>, tf_although <dbl>, tf_always <dbl>,
## # tf_america <dbl>, tf_american <dbl>, tf_among <dbl>, tf_amongst <dbl>,
## # tf_amount <dbl>, tf_anatomy <dbl>, tf_ancient <dbl>, tf_animal <dbl>,
## # tf_animals <dbl>, tf_another <dbl>, tf_anything <dbl>, tf_anywhere <dbl>,
## # tf_appear <dbl>, tf_appearance <dbl>, tf_approximately <dbl>,
## # tf_aquatic <dbl>, tf_area <dbl>, tf_areas <dbl>, tf_around <dbl>,
## # tf_asia <dbl>, tf_asian <dbl>, tf_attack <dbl>, tf_attract <dbl>,
## # tf_australia <dbl>, tf_available <dbl>, tf_average <dbl>, tf_avoid <dbl>,
## # tf_away <dbl>, tf_babies <dbl>, tf_baby <dbl>, tf_back <dbl>,
## # tf_based <dbl>, tf_beak <dbl>, tf_become <dbl>, tf_becoming <dbl>,
## # tf_begin <dbl>, tf_begins <dbl>, tf_behavior <dbl>, tf_behaviour <dbl>,
## # tf_behind <dbl>, tf_believe <dbl>, tf_believed <dbl>, tf_belong <dbl>,
## # tf_berries <dbl>, tf_best <dbl>, tf_better <dbl>, tf_big <dbl>,
## # tf_bigger <dbl>, tf_biggest <dbl>, tf_bird <dbl>, tf_birds <dbl>,
## # tf_birth <dbl>, tf_bite <dbl>, tf_black <dbl>, tf_blind <dbl>,
## # tf_bodies <dbl>, tf_body <dbl>, tf_born <dbl>, tf_branches <dbl>,
## # tf_bred <dbl>, tf_breed <dbl>, tf_breeding <dbl>, …
mixture = 1 indicates that it is a lasso model.
# engine specification
lasso_spec <- linear_reg(mixture = 1, penalty = tune()) %>%
set_mode("regression") %>%
set_engine("glmnet")Put in our ingredients and get a recipe. Because lasso regression is scale sensitive, we must ensure that the variables are on the same scale by using step_normalize(all_predictors()).
# customize a recipe
lasso_rec <- recipe(weight ~., data = animals_train) %>%
step_normalize(all_predictors()) %>%
step_zv(all_predictors()) lasso_wf <- workflow() %>%
add_model(lasso_spec) %>%
add_recipe(lasso_rec)Create a bootstrap term in order to use in the following tune_grid() session. We use 10 bootstraps.
set.seed(4321)
animals_boots <- bootstraps(animals_train, strata = weight, time = 10)
animals_boots$splits## [[1]]
## <Analysis/Assess/Total>
## <361/132/361>
##
## [[2]]
## <Analysis/Assess/Total>
## <361/131/361>
##
## [[3]]
## <Analysis/Assess/Total>
## <361/119/361>
##
## [[4]]
## <Analysis/Assess/Total>
## <361/136/361>
##
## [[5]]
## <Analysis/Assess/Total>
## <361/134/361>
##
## [[6]]
## <Analysis/Assess/Total>
## <361/134/361>
##
## [[7]]
## <Analysis/Assess/Total>
## <361/122/361>
##
## [[8]]
## <Analysis/Assess/Total>
## <361/139/361>
##
## [[9]]
## <Analysis/Assess/Total>
## <361/135/361>
##
## [[10]]
## <Analysis/Assess/Total>
## <361/126/361>
Regularly predict the penalty 50 times using regular grids, with the penalty range limited to \(10^{-5}\) to \(0\). Note, these are in transformed units, the default transformation is \(log10\).
penalty_grid <- grid_regular(penalty(range = c(-5, 0)), levels = 50)tune_res <- tune_grid(object = lasso_wf, resamples = animals_boots,
grid = penalty_grid)
tune_res## # Tuning results
## # Bootstrap sampling using stratification
## # A tibble: 10 x 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [361/132]> Bootstrap01 <tibble [100 × 5]> <tibble [0 × 1]>
## 2 <split [361/131]> Bootstrap02 <tibble [100 × 5]> <tibble [0 × 1]>
## 3 <split [361/119]> Bootstrap03 <tibble [100 × 5]> <tibble [0 × 1]>
## 4 <split [361/136]> Bootstrap04 <tibble [100 × 5]> <tibble [0 × 1]>
## 5 <split [361/134]> Bootstrap05 <tibble [100 × 5]> <tibble [0 × 1]>
## 6 <split [361/134]> Bootstrap06 <tibble [100 × 5]> <tibble [0 × 1]>
## 7 <split [361/122]> Bootstrap07 <tibble [100 × 5]> <tibble [0 × 1]>
## 8 <split [361/139]> Bootstrap08 <tibble [100 × 5]> <tibble [0 × 1]>
## 9 <split [361/135]> Bootstrap09 <tibble [100 × 5]> <tibble [0 × 1]>
## 10 <split [361/126]> Bootstrap10 <tibble [100 × 5]> <tibble [0 × 1]>
Display the each penalty on rmse and rsq, respectively.
tune_res %>%
collect_metrics()## # A tibble: 100 x 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.00001 rmse standard 4.02 10 0.317 Preprocessor1_Model01
## 2 0.00001 rsq standard 0.291 10 0.0188 Preprocessor1_Model01
## 3 0.0000126 rmse standard 4.02 10 0.317 Preprocessor1_Model02
## 4 0.0000126 rsq standard 0.291 10 0.0188 Preprocessor1_Model02
## 5 0.0000160 rmse standard 4.02 10 0.317 Preprocessor1_Model03
## 6 0.0000160 rsq standard 0.291 10 0.0188 Preprocessor1_Model03
## 7 0.0000202 rmse standard 4.02 10 0.317 Preprocessor1_Model04
## 8 0.0000202 rsq standard 0.291 10 0.0188 Preprocessor1_Model04
## 9 0.0000256 rmse standard 4.02 10 0.317 Preprocessor1_Model05
## 10 0.0000256 rsq standard 0.291 10 0.0188 Preprocessor1_Model05
## # … with 90 more rows
Display the best rmse and rsq values of penalty
tune_res %>%
show_best(metric = "rsq") %>%
head(1)## # A tibble: 1 x 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.244 rsq standard 0.387 10 0.0216 Preprocessor1_Model44
tune_res %>%
show_best(metric = "rmse") %>%
head(1)## # A tibble: 1 x 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.244 rmse standard 2.86 10 0.165 Preprocessor1_Model44
We can see that if the amount of regularization is close to 0.25, the rmse is low and the rsq is high, on average. Thus, the best hyperparameter of this model should be here.
tune_res %>%
autoplot() +
geom_vline(xintercept = 0.2442053, color = "red") The best value of \(\lambda\) is 0.244.
best_rmse <- select_best(tune_res, metric = "rmse")
best_rmse## # A tibble: 1 x 2
## penalty .config
## <dbl> <chr>
## 1 0.244 Preprocessor1_Model44
lasso_final <- finalize_workflow(lasso_wf, best_rmse)lasso_final_fit <- fit(lasso_final, data = animals_train) Fit a lasso regression model to predict weight based on all the other variables. With the rmse value, we can conclude that the model performs mediocrity. However, we can carry out additional comparisons, such as canceling the penalty term in the model, to determine whether the good performance is due to hyperparameters or the model itself.
augment(lasso_final_fit, new_data = animals_test) %>%
rmse(truth = weight, estimate = .pred)## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 2.92
augment(lasso_final_fit, new_data = animals_test) %>%
ggplot(aes(weight, .pred)) +
geom_abline(slope = 1, intercept = 0) +
geom_point() +
theme_bw() +
ggtitle("Lasso Prediction of Weight")The weight range is from -8.111728 to 10.491274 from the original data set, and based on the plot above, we can conclude that the model does not work well; tuning \(\lambda\) appears to have no obvious benefit in the model.
animals %>%
select(weight) %>%
range()## [1] -8.111728 10.491274