suppressPackageStartupMessages( require(oetteR) ) 
suppressPackageStartupMessages( require(tidyverse) )

1 Cleaning Data

Before I learned about recipes. I had my own set of functions that would do something similar with less functionality of course. oetteR::f_clean_data() takes a dataframe and performs some automated cleaning steps and sorts the variables into categrocial and numerical categories and returns a list which I always name data_ls.

data = ISLR::Auto

data_ls = f_clean_data( data = data
                         # reduce number of levels to 10, group to other
                         , max_number_of_levels_factors = 10
                         # numericals with less than 10 unique values will be converted to factors
                         , min_number_of_levels_nums = 10
                         # exclude missing data
                         , exclude_missing = T
                         # negative values will be set to zero
                         ,replace_neg_values_with_zero = T
                         # allow negative values in these columns
                         ,allow_neg_values = 'null'
                         # tag id columns
                         , id_cols = 'name'
                         )

## [1] "Number of excluded observations: 0"

print( str(data_ls) )

## List of 6
##  $ data                :Classes 'tbl_df', 'tbl' and 'data.frame':    392 obs. of  9 variables:
##   ..$ mpg         : num [1:392] 18 15 18 16 17 15 14 14 14 15 ...
##   ..$ cylinders   : Ord.factor w/ 5 levels "3"<"4"<"5"<"6"<..: 5 5 5 5 5 5 5 5 5 5 ...
##   ..$ displacement: num [1:392] 307 350 318 304 302 429 454 440 455 390 ...
##   ..$ horsepower  : num [1:392] 130 165 150 150 140 198 220 215 225 190 ...
##   ..$ weight      : num [1:392] 3504 3693 3436 3433 3449 ...
##   ..$ acceleration: num [1:392] 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##   ..$ year        : num [1:392] 70 70 70 70 70 70 70 70 70 70 ...
##   ..$ origin      : Ord.factor w/ 3 levels "1"<"2"<"3": 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  $ numericals          : chr [1:6] "mpg" "displacement" "horsepower" "weight" ...
##  $ categoricals        : chr [1:2] "cylinders" "origin"
##  $ categoricals_ordered: chr [1:2] "cylinders" "origin"
##  $ all_variables       : chr [1:8] "mpg" "cylinders" "displacement" "horsepower" ...
##  $ ids                 : chr "name"
## NULL

Notice that numericals are converted to ordered factors

2 BoxCox Transformation

We can use a boxcox transformation on numerical variables.

data_ls = f_boxcox(data_ls)

print( str(data_ls) )

## List of 8
##  $ data                :Classes 'tbl_df', 'tbl' and 'data.frame':    392 obs. of  9 variables:
##   ..$ mpg         : num [1:392] 18 15 18 16 17 15 14 14 14 15 ...
##   ..$ cylinders   : Ord.factor w/ 5 levels "3"<"4"<"5"<"6"<..: 5 5 5 5 5 5 5 5 5 5 ...
##   ..$ displacement: num [1:392] 307 350 318 304 302 429 454 440 455 390 ...
##   ..$ horsepower  : num [1:392] 130 165 150 150 140 198 220 215 225 190 ...
##   ..$ weight      : num [1:392] 3504 3693 3436 3433 3449 ...
##   ..$ acceleration: num [1:392] 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##   ..$ year        : num [1:392] 70 70 70 70 70 70 70 70 70 70 ...
##   ..$ origin      : Ord.factor w/ 3 levels "1"<"2"<"3": 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  $ numericals          : chr [1:6] "mpg" "displacement" "horsepower" "weight" ...
##  $ categoricals        : chr [1:2] "cylinders" "origin"
##  $ categoricals_ordered: chr [1:2] "cylinders" "origin"
##  $ all_variables       : chr [1:8] "mpg" "cylinders" "displacement" "horsepower" ...
##  $ ids                 : chr "name"
##  $ boxcox_data         :'data.frame':    392 obs. of  6 variables:
##   ..$ mpg_boxcox         : num [1:392] 3.91 3.59 3.91 3.71 3.81 ...
##   ..$ displacement_boxcox: num [1:392] 2.74 2.76 2.74 2.73 2.73 ...
##   ..$ horsepower_boxcox  : num [1:392] 1.82 1.84 1.84 1.84 1.83 ...
##   ..$ weight_boxcox      : num [1:392] 3.05 3.05 3.04 3.04 3.04 ...
##   ..$ acceleration_boxcox: num [1:392] 5.74 5.55 5.36 5.74 5.17 ...
##   ..$ year_boxcox        : num [1:392] 14.7 14.7 14.7 14.7 14.7 ...
##  $ boxcox_names        : chr [1:6] "mpg_boxcox" "displacement_boxcox" "horsepower_boxcox" "weight_boxcox" ...
## NULL

3 PCA

Base R has a decent function for PCA prcomp(). However the returned object is a bit messy and contains a lot of matrices intesad of dataframes. oetteR::f_pca() is a convenience wrapper which uses data_ls lists.

pca_ls = f_pca( data_ls
             , center = T
             , scale = T
             , use_boxcox_tansformed_vars = T
             , include_ordered_categoricals = T
             # drop components that explain less variance than the threshold
             , threshold_vae_for_pc_perc = 0 )

The returned pca object has some new features compared to prcomp()

3.1 Call the original Data with the Principle Components as additional Variables

as_tibble( pca_ls$data )

## # A tibble: 392 x 17
##      mpg cylinders displacement horsepower weight acceleration  year
##    <dbl>     <ord>        <dbl>      <dbl>  <dbl>        <dbl> <dbl>
##  1    18         8          307        130   3504         12.0    70
##  2    15         8          350        165   3693         11.5    70
##  3    18         8          318        150   3436         11.0    70
##  4    16         8          304        150   3433         12.0    70
##  5    17         8          302        140   3449         10.5    70
##  6    15         8          429        198   4341         10.0    70
##  7    14         8          454        220   4354          9.0    70
##  8    14         8          440        215   4312          8.5    70
##  9    14         8          455        225   4425         10.0    70
## 10    15         8          390        190   3850          8.5    70
## # ... with 382 more rows, and 10 more variables: origin <ord>,
## #   name <fctr>, PC1 <dbl>, PC2 <dbl>, PC3 <dbl>, PC4 <dbl>, PC5 <dbl>,
## #   PC6 <dbl>, PC7 <dbl>, PC8 <dbl>

3.2 Get variance explained as a dataframe

pca_ls$pca$vae

## # A tibble: 1 x 9
##   row_names      PC1      PC2      PC3      PC4      PC5     PC6      PC7
##       <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>    <dbl>
## 1       vae 38.79163 16.55043 14.77422 11.44624 7.038674 4.68511 3.811565
## # ... with 1 more variables: PC8 <dbl>

3.3 How much is each variable contributing to each component in percent

Note the columns add up to 100

pca_ls$pca$contrib

## # A tibble: 8 x 9
##             row_names       PC1        PC2         PC3        PC4
##                 <chr>     <dbl>      <dbl>       <dbl>      <dbl>
## 1          mpg_boxcox 15.869580  0.9922695  5.47987229  3.6485975
## 2 displacement_boxcox 16.988297  3.7871142  0.02072897  0.8363316
## 3   horsepower_boxcox 16.018206  1.0071933  6.25926156  1.2594684
## 4       weight_boxcox 16.082134  4.8115352  0.01651671  9.4727387
## 5 acceleration_boxcox  6.845899 18.8903843 43.99632455 19.2871689
## 6         year_boxcox  4.074347 49.2783956 35.28727778  1.4576253
## 7           cylinders 15.875581  0.8889896  0.33176904  3.9967818
## 8              origin  8.245957 20.3441184  8.60824909 60.0412879
## # ... with 4 more variables: PC5 <dbl>, PC6 <dbl>, PC7 <dbl>, PC8 <dbl>

3.4 Contribution of variables to variance explained by principle component

pca_ls$pca$contrib_abs_perc

## # A tibble: 64 x 3
##              row_names    PC     value
##                  <chr> <chr>     <dbl>
##  1          mpg_boxcox   PC1 6.1560683
##  2 displacement_boxcox   PC1 6.5900371
##  3   horsepower_boxcox   PC1 6.2137229
##  4       weight_boxcox   PC1 6.2385215
##  5 acceleration_boxcox   PC1 2.6556357
##  6         year_boxcox   PC1 1.5805054
##  7           cylinders   PC1 6.1583961
##  8              origin   PC1 3.1987410
##  9          mpg_boxcox   PC2 0.1642249
## 10 displacement_boxcox   PC2 0.6267836
## # ... with 54 more rows

4 Plot PCA

4.1 Plot variance explained

Note returns a plotly graph by default

oetteR::f_pca_plot_variance_explained

f_pca_plot_variance_explained(pca_ls
                              # dont include componetns that explain less than 2.5 percent of the variabnce
                              , threshold_vae_for_pc_perc = 2.5)

4.2 Plot first two principle components and color by cylinder

and color cylinders, oetteR::f_pca_plot_variance_explained returns a taglist created by htmltools::tagList which stores a plotly graph and a DT:datatable. The algebraic sign (+/-) of the rotation value in the table tells you whether a contribution of one variable to one principle component is positive or negative.

taglist = f_pca_plot_components( pca_ls
                                 , x_axis = 'PC1'
                                 , y_axis = 'PC2'
                                 , group = 'cylinders' )

taglist

Data Preparation and Principle Component Analysis

OEB

March 1, 2018