suppressPackageStartupMessages( require(oetteR) )
suppressPackageStartupMessages( require(tidyverse) )
Before I learned about recipes. I had my own set of functions that would do something similar with less functionality of course. oetteR::f_clean_data() takes a dataframe and performs some automated cleaning steps and sorts the variables into categrocial and numerical categories and returns a list which I always name data_ls.
data = ISLR::Auto
data_ls = f_clean_data( data = data
# reduce number of levels to 10, group to other
, max_number_of_levels_factors = 10
# numericals with less than 10 unique values will be converted to factors
, min_number_of_levels_nums = 10
# exclude missing data
, exclude_missing = T
# negative values will be set to zero
,replace_neg_values_with_zero = T
# allow negative values in these columns
,allow_neg_values = 'null'
# tag id columns
, id_cols = 'name'
)
## [1] "Number of excluded observations: 0"
print( str(data_ls) )
## List of 6
## $ data :Classes 'tbl_df', 'tbl' and 'data.frame': 392 obs. of 9 variables:
## ..$ mpg : num [1:392] 18 15 18 16 17 15 14 14 14 15 ...
## ..$ cylinders : Ord.factor w/ 5 levels "3"<"4"<"5"<"6"<..: 5 5 5 5 5 5 5 5 5 5 ...
## ..$ displacement: num [1:392] 307 350 318 304 302 429 454 440 455 390 ...
## ..$ horsepower : num [1:392] 130 165 150 150 140 198 220 215 225 190 ...
## ..$ weight : num [1:392] 3504 3693 3436 3433 3449 ...
## ..$ acceleration: num [1:392] 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## ..$ year : num [1:392] 70 70 70 70 70 70 70 70 70 70 ...
## ..$ origin : Ord.factor w/ 3 levels "1"<"2"<"3": 1 1 1 1 1 1 1 1 1 1 ...
## ..$ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## $ numericals : chr [1:6] "mpg" "displacement" "horsepower" "weight" ...
## $ categoricals : chr [1:2] "cylinders" "origin"
## $ categoricals_ordered: chr [1:2] "cylinders" "origin"
## $ all_variables : chr [1:8] "mpg" "cylinders" "displacement" "horsepower" ...
## $ ids : chr "name"
## NULL
Notice that numericals are converted to ordered factors
We can use a boxcox transformation on numerical variables.
data_ls = f_boxcox(data_ls)
print( str(data_ls) )
## List of 8
## $ data :Classes 'tbl_df', 'tbl' and 'data.frame': 392 obs. of 9 variables:
## ..$ mpg : num [1:392] 18 15 18 16 17 15 14 14 14 15 ...
## ..$ cylinders : Ord.factor w/ 5 levels "3"<"4"<"5"<"6"<..: 5 5 5 5 5 5 5 5 5 5 ...
## ..$ displacement: num [1:392] 307 350 318 304 302 429 454 440 455 390 ...
## ..$ horsepower : num [1:392] 130 165 150 150 140 198 220 215 225 190 ...
## ..$ weight : num [1:392] 3504 3693 3436 3433 3449 ...
## ..$ acceleration: num [1:392] 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## ..$ year : num [1:392] 70 70 70 70 70 70 70 70 70 70 ...
## ..$ origin : Ord.factor w/ 3 levels "1"<"2"<"3": 1 1 1 1 1 1 1 1 1 1 ...
## ..$ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## $ numericals : chr [1:6] "mpg" "displacement" "horsepower" "weight" ...
## $ categoricals : chr [1:2] "cylinders" "origin"
## $ categoricals_ordered: chr [1:2] "cylinders" "origin"
## $ all_variables : chr [1:8] "mpg" "cylinders" "displacement" "horsepower" ...
## $ ids : chr "name"
## $ boxcox_data :'data.frame': 392 obs. of 6 variables:
## ..$ mpg_boxcox : num [1:392] 3.91 3.59 3.91 3.71 3.81 ...
## ..$ displacement_boxcox: num [1:392] 2.74 2.76 2.74 2.73 2.73 ...
## ..$ horsepower_boxcox : num [1:392] 1.82 1.84 1.84 1.84 1.83 ...
## ..$ weight_boxcox : num [1:392] 3.05 3.05 3.04 3.04 3.04 ...
## ..$ acceleration_boxcox: num [1:392] 5.74 5.55 5.36 5.74 5.17 ...
## ..$ year_boxcox : num [1:392] 14.7 14.7 14.7 14.7 14.7 ...
## $ boxcox_names : chr [1:6] "mpg_boxcox" "displacement_boxcox" "horsepower_boxcox" "weight_boxcox" ...
## NULL
Base R has a decent function for PCA prcomp(). However the returned object is a bit messy and contains a lot of matrices intesad of dataframes. oetteR::f_pca() is a convenience wrapper which uses data_ls lists.
pca_ls = f_pca( data_ls
, center = T
, scale = T
, use_boxcox_tansformed_vars = T
, include_ordered_categoricals = T
# drop components that explain less variance than the threshold
, threshold_vae_for_pc_perc = 0 )
The returned pca object has some new features compared to prcomp()
as_tibble( pca_ls$data )
## # A tibble: 392 x 17
## mpg cylinders displacement horsepower weight acceleration year
## <dbl> <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 18 8 307 130 3504 12.0 70
## 2 15 8 350 165 3693 11.5 70
## 3 18 8 318 150 3436 11.0 70
## 4 16 8 304 150 3433 12.0 70
## 5 17 8 302 140 3449 10.5 70
## 6 15 8 429 198 4341 10.0 70
## 7 14 8 454 220 4354 9.0 70
## 8 14 8 440 215 4312 8.5 70
## 9 14 8 455 225 4425 10.0 70
## 10 15 8 390 190 3850 8.5 70
## # ... with 382 more rows, and 10 more variables: origin <ord>,
## # name <fctr>, PC1 <dbl>, PC2 <dbl>, PC3 <dbl>, PC4 <dbl>, PC5 <dbl>,
## # PC6 <dbl>, PC7 <dbl>, PC8 <dbl>
pca_ls$pca$vae
## # A tibble: 1 x 9
## row_names PC1 PC2 PC3 PC4 PC5 PC6 PC7
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 vae 38.79163 16.55043 14.77422 11.44624 7.038674 4.68511 3.811565
## # ... with 1 more variables: PC8 <dbl>
Note the columns add up to 100
pca_ls$pca$contrib
## # A tibble: 8 x 9
## row_names PC1 PC2 PC3 PC4
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 mpg_boxcox 15.869580 0.9922695 5.47987229 3.6485975
## 2 displacement_boxcox 16.988297 3.7871142 0.02072897 0.8363316
## 3 horsepower_boxcox 16.018206 1.0071933 6.25926156 1.2594684
## 4 weight_boxcox 16.082134 4.8115352 0.01651671 9.4727387
## 5 acceleration_boxcox 6.845899 18.8903843 43.99632455 19.2871689
## 6 year_boxcox 4.074347 49.2783956 35.28727778 1.4576253
## 7 cylinders 15.875581 0.8889896 0.33176904 3.9967818
## 8 origin 8.245957 20.3441184 8.60824909 60.0412879
## # ... with 4 more variables: PC5 <dbl>, PC6 <dbl>, PC7 <dbl>, PC8 <dbl>
pca_ls$pca$contrib_abs_perc
## # A tibble: 64 x 3
## row_names PC value
## <chr> <chr> <dbl>
## 1 mpg_boxcox PC1 6.1560683
## 2 displacement_boxcox PC1 6.5900371
## 3 horsepower_boxcox PC1 6.2137229
## 4 weight_boxcox PC1 6.2385215
## 5 acceleration_boxcox PC1 2.6556357
## 6 year_boxcox PC1 1.5805054
## 7 cylinders PC1 6.1583961
## 8 origin PC1 3.1987410
## 9 mpg_boxcox PC2 0.1642249
## 10 displacement_boxcox PC2 0.6267836
## # ... with 54 more rows
Note returns a plotly graph by default
oetteR::f_pca_plot_variance_explained
f_pca_plot_variance_explained(pca_ls
# dont include componetns that explain less than 2.5 percent of the variabnce
, threshold_vae_for_pc_perc = 2.5)
and color cylinders, oetteR::f_pca_plot_variance_explained returns a taglist created by htmltools::tagList which stores a plotly graph and a DT:datatable. The algebraic sign (+/-) of the rotation value in the table tells you whether a contribution of one variable to one principle component is positive or negative.
taglist = f_pca_plot_components( pca_ls
, x_axis = 'PC1'
, y_axis = 'PC2'
, group = 'cylinders' )
taglist