train.csv data.Split the data into a training set and a testing set as two named objects. Produce the class type for the initial split object and the training and test sets.
## [1] "data" "in_id" "out_id" "id"
## [1] "rsplit" "mc_split"
## [1] "data.frame"
## [1] "data.frame"
train.csv data that went to each of the training and test sets.## <Analysis/Assess/Total>
## <28414/9471/37885>
## [1] 0.7500066
## [1] 0.2499934
Use 10-fold cross-validation to resample the training data.
## # 10-fold cross-validation
## # A tibble: 10 x 2
## splits id
## <list> <chr>
## 1 <split [25.6K/2.8K]> Fold01
## 2 <split [25.6K/2.8K]> Fold02
## 3 <split [25.6K/2.8K]> Fold03
## 4 <split [25.6K/2.8K]> Fold04
## 5 <split [25.6K/2.8K]> Fold05
## 6 <split [25.6K/2.8K]> Fold06
## 7 <split [25.6K/2.8K]> Fold07
## 8 <split [25.6K/2.8K]> Fold08
## 9 <split [25.6K/2.8K]> Fold09
## 10 <split [25.6K/2.8K]> Fold10
{purrr} to add the following columns to your k-fold CV object:sp_ed_fg) in the analysis and assessment sets for each fold## # 10-fold cross-validation
## # A tibble: 10 x 8
## splits id analysis_n assessment_n analysis_p assessment_p sped_assessment…
## <list> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 <spli… Fold… 25572 2842 0.900 0.100 0.139
## 2 <spli… Fold… 25572 2842 0.900 0.100 0.132
## 3 <spli… Fold… 25572 2842 0.900 0.100 0.129
## 4 <spli… Fold… 25572 2842 0.900 0.100 0.146
## 5 <spli… Fold… 25573 2841 0.900 0.100 0.134
## 6 <spli… Fold… 25573 2841 0.900 0.100 0.133
## 7 <spli… Fold… 25573 2841 0.900 0.100 0.128
## 8 <spli… Fold… 25573 2841 0.900 0.100 0.141
## 9 <spli… Fold… 25573 2841 0.900 0.100 0.133
## 10 <spli… Fold… 25573 2841 0.900 0.100 0.146
## # … with 1 more variable: sped_analysis_p <dbl>
id columns of the assessment data between Fold01 & Fold02, and Fold09 & Fold10 (of your 10-fold cross-validation object).## [1] id dupe_count
## <0 rows> (or 0-length row.names)
## [1] id dupe_count
## <0 rows> (or 0-length row.names)
For the following code vfold_cv(fictional_train, v = 20):
## # Monte Carlo cross-validation (0.3/0.7) with 20 resamples
## # A tibble: 20 x 2
## splits id
## <list> <chr>
## 1 <split [11.4K/26.5K]> Resample01
## 2 <split [11.4K/26.5K]> Resample02
## 3 <split [11.4K/26.5K]> Resample03
## 4 <split [11.4K/26.5K]> Resample04
## 5 <split [11.4K/26.5K]> Resample05
## 6 <split [11.4K/26.5K]> Resample06
## 7 <split [11.4K/26.5K]> Resample07
## 8 <split [11.4K/26.5K]> Resample08
## 9 <split [11.4K/26.5K]> Resample09
## 10 <split [11.4K/26.5K]> Resample10
## 11 <split [11.4K/26.5K]> Resample11
## 12 <split [11.4K/26.5K]> Resample12
## 13 <split [11.4K/26.5K]> Resample13
## 14 <split [11.4K/26.5K]> Resample14
## 15 <split [11.4K/26.5K]> Resample15
## 16 <split [11.4K/26.5K]> Resample16
## 17 <split [11.4K/26.5K]> Resample17
## 18 <split [11.4K/26.5K]> Resample18
## 19 <split [11.4K/26.5K]> Resample19
## 20 <split [11.4K/26.5K]> Resample20
id columns of the assessment data between Resample 8 & Resample 12, and Resample 2 & Resample 20in your MC CV object.## n
## 1 18564
## n
## 1 18614