Read in the train.csv data.

1. Initial Split

Split the data into a training set and a testing set as two named objects. Produce the class type for the initial split object and the training and test sets.

## [1] "data"   "in_id"  "out_id" "id"
## [1] "rsplit"   "mc_split"
## [1] "data.frame"
## [1] "data.frame"

2. Use code to show the proportion of the train.csv data that went to each of the training and test sets.

## <Analysis/Assess/Total>
## <28414/9471/37885>
## [1] 0.7500066
## [1] 0.2499934

3. k-fold cross-validation

Use 10-fold cross-validation to resample the training data.

## #  10-fold cross-validation 
## # A tibble: 10 x 2
##    splits               id    
##    <list>               <chr> 
##  1 <split [25.6K/2.8K]> Fold01
##  2 <split [25.6K/2.8K]> Fold02
##  3 <split [25.6K/2.8K]> Fold03
##  4 <split [25.6K/2.8K]> Fold04
##  5 <split [25.6K/2.8K]> Fold05
##  6 <split [25.6K/2.8K]> Fold06
##  7 <split [25.6K/2.8K]> Fold07
##  8 <split [25.6K/2.8K]> Fold08
##  9 <split [25.6K/2.8K]> Fold09
## 10 <split [25.6K/2.8K]> Fold10

4. Use {purrr} to add the following columns to your k-fold CV object:

## #  10-fold cross-validation 
## # A tibble: 10 x 8
##    splits id    analysis_n assessment_n analysis_p assessment_p sped_assessment…
##    <list> <chr>      <dbl>        <dbl>      <dbl>        <dbl>            <dbl>
##  1 <spli… Fold…      25572         2842      0.900        0.100            0.139
##  2 <spli… Fold…      25572         2842      0.900        0.100            0.132
##  3 <spli… Fold…      25572         2842      0.900        0.100            0.129
##  4 <spli… Fold…      25572         2842      0.900        0.100            0.146
##  5 <spli… Fold…      25573         2841      0.900        0.100            0.134
##  6 <spli… Fold…      25573         2841      0.900        0.100            0.133
##  7 <spli… Fold…      25573         2841      0.900        0.100            0.128
##  8 <spli… Fold…      25573         2841      0.900        0.100            0.141
##  9 <spli… Fold…      25573         2841      0.900        0.100            0.133
## 10 <spli… Fold…      25573         2841      0.900        0.100            0.146
## # … with 1 more variable: sped_analysis_p <dbl>

5. Please demonstrate that that there are no common values in the id columns of the assessment data between Fold01 & Fold02, and Fold09 & Fold10 (of your 10-fold cross-validation object).

## [1] id         dupe_count
## <0 rows> (or 0-length row.names)
## [1] id         dupe_count
## <0 rows> (or 0-length row.names)

6. Try to answer these next questions without running similar code on real data.

For the following code vfold_cv(fictional_train, v = 20):

7. Use Monte Carlo CV to resample the training data with 20 resamples and .30 of each resample reserved for the assessment sets.

## # Monte Carlo cross-validation (0.3/0.7) with 20 resamples  
## # A tibble: 20 x 2
##    splits                id        
##    <list>                <chr>     
##  1 <split [11.4K/26.5K]> Resample01
##  2 <split [11.4K/26.5K]> Resample02
##  3 <split [11.4K/26.5K]> Resample03
##  4 <split [11.4K/26.5K]> Resample04
##  5 <split [11.4K/26.5K]> Resample05
##  6 <split [11.4K/26.5K]> Resample06
##  7 <split [11.4K/26.5K]> Resample07
##  8 <split [11.4K/26.5K]> Resample08
##  9 <split [11.4K/26.5K]> Resample09
## 10 <split [11.4K/26.5K]> Resample10
## 11 <split [11.4K/26.5K]> Resample11
## 12 <split [11.4K/26.5K]> Resample12
## 13 <split [11.4K/26.5K]> Resample13
## 14 <split [11.4K/26.5K]> Resample14
## 15 <split [11.4K/26.5K]> Resample15
## 16 <split [11.4K/26.5K]> Resample16
## 17 <split [11.4K/26.5K]> Resample17
## 18 <split [11.4K/26.5K]> Resample18
## 19 <split [11.4K/26.5K]> Resample19
## 20 <split [11.4K/26.5K]> Resample20

8. Please demonstrate that that there are common values in the id columns of the assessment data between Resample 8 & Resample 12, and Resample 2 & Resample 20in your MC CV object.

##       n
## 1 18564
##       n
## 1 18614

9. You plan on doing bootstrap resampling with a training set with n = 500.