lab1.utf8

Read in the `train.csv` data.

1. Initial Split

Split the data into a training set and a testing set as two named objects. Produce the class type for the initial split object and the training and test sets.

## [1] "data"   "in_id"  "out_id" "id"

## [1] "rsplit"   "mc_split"

## [1] "data.frame"

## [1] "data.frame"

2. Use code to show the proportion of the `train.csv` data that went to each of the training and test sets.

## <Analysis/Assess/Total>
## <28414/9471/37885>

## [1] 0.7500066

## [1] 0.2499934

3. k-fold cross-validation

Use 10-fold cross-validation to resample the training data.

## #  10-fold cross-validation 
## # A tibble: 10 x 2
##    splits               id    
##    <list>               <chr> 
##  1 <split [25.6K/2.8K]> Fold01
##  2 <split [25.6K/2.8K]> Fold02
##  3 <split [25.6K/2.8K]> Fold03
##  4 <split [25.6K/2.8K]> Fold04
##  5 <split [25.6K/2.8K]> Fold05
##  6 <split [25.6K/2.8K]> Fold06
##  7 <split [25.6K/2.8K]> Fold07
##  8 <split [25.6K/2.8K]> Fold08
##  9 <split [25.6K/2.8K]> Fold09
## 10 <split [25.6K/2.8K]> Fold10

4. Use `{purrr}` to add the following columns to your k-fold CV object:

analysis_n = the n of the analysis set for each fold
assessment_n = the n of the assessment set for each fold
analysis_p = the proportion of the analysis set for each fold
assessment_p = the proportion of the assessment set for each fold
sped_p = the proportion of students receiving special education services (sp_ed_fg) in the analysis and assessment sets for each fold

## #  10-fold cross-validation 
## # A tibble: 10 x 8
##    splits id    analysis_n assessment_n analysis_p assessment_p sped_assessment…
##    <list> <chr>      <dbl>        <dbl>      <dbl>        <dbl>            <dbl>
##  1 <spli… Fold…      25572         2842      0.900        0.100            0.139
##  2 <spli… Fold…      25572         2842      0.900        0.100            0.132
##  3 <spli… Fold…      25572         2842      0.900        0.100            0.129
##  4 <spli… Fold…      25572         2842      0.900        0.100            0.146
##  5 <spli… Fold…      25573         2841      0.900        0.100            0.134
##  6 <spli… Fold…      25573         2841      0.900        0.100            0.133
##  7 <spli… Fold…      25573         2841      0.900        0.100            0.128
##  8 <spli… Fold…      25573         2841      0.900        0.100            0.141
##  9 <spli… Fold…      25573         2841      0.900        0.100            0.133
## 10 <spli… Fold…      25573         2841      0.900        0.100            0.146
## # … with 1 more variable: sped_analysis_p <dbl>

5. Please demonstrate that that there are no common values in the `id` columns of the `assessment` data between `Fold01` & `Fold02`, and `Fold09` & `Fold10` (of your 10-fold cross-validation object).

## [1] id         dupe_count
## <0 rows> (or 0-length row.names)

## [1] id         dupe_count
## <0 rows> (or 0-length row.names)

6. Try to answer these next questions without running similar code on real data.

For the following code vfold_cv(fictional_train, v = 20):

What is the proportion in the analysis set for each fold? 5%
What is the proportion in the assessment set for each fold? 95%

7. Use Monte Carlo CV to resample the training data with 20 resamples and .30 of each resample reserved for the assessment sets.

## # Monte Carlo cross-validation (0.3/0.7) with 20 resamples  
## # A tibble: 20 x 2
##    splits                id        
##    <list>                <chr>     
##  1 <split [11.4K/26.5K]> Resample01
##  2 <split [11.4K/26.5K]> Resample02
##  3 <split [11.4K/26.5K]> Resample03
##  4 <split [11.4K/26.5K]> Resample04
##  5 <split [11.4K/26.5K]> Resample05
##  6 <split [11.4K/26.5K]> Resample06
##  7 <split [11.4K/26.5K]> Resample07
##  8 <split [11.4K/26.5K]> Resample08
##  9 <split [11.4K/26.5K]> Resample09
## 10 <split [11.4K/26.5K]> Resample10
## 11 <split [11.4K/26.5K]> Resample11
## 12 <split [11.4K/26.5K]> Resample12
## 13 <split [11.4K/26.5K]> Resample13
## 14 <split [11.4K/26.5K]> Resample14
## 15 <split [11.4K/26.5K]> Resample15
## 16 <split [11.4K/26.5K]> Resample16
## 17 <split [11.4K/26.5K]> Resample17
## 18 <split [11.4K/26.5K]> Resample18
## 19 <split [11.4K/26.5K]> Resample19
## 20 <split [11.4K/26.5K]> Resample20

8. Please demonstrate that that there are common values in the `id` columns of the `assessment` data between `Resample 8` & `Resample 12`, and `Resample 2` & `Resample 20`in your MC CV object.

##       n
## 1 18564

##       n
## 1 18614

9. You plan on doing bootstrap resampling with a training set with n = 500.

What is the sample size of an analysis set for a given bootstrap resample? 500
What is the sample size of an assessment set for a given bootstrap resample? 180
If each row was selected only once for an analysis set:
- what would be the size of the analysis set? 315
- and what would be the size of the assessment set? 185

Read in the train.csv data.