Machine Learning I                                   
                                                      
 
Directions:  Complete the following exercises.

1. What is the goal of predictive modeling?
•   to create models that make good predictions on new data.

2.  What statistical methods do we use to estimate the performance of a model on new data?
•   Resampling methods

3.  Are these statistical tools used on training data or on testing data?
•   Testing data

4.  What is the goal of resampling methods?
•   to make the best use of your training data in order to accurately estimate the performance of a model on new unseen data

5.  Fill-in-the-blank: Accurate _______ _______ can be used to help you choose which set of parameters to use or which method to select.
•   estimates of performance

6.  List the two common resampling methods.
•   A train and test split of your data.
•   k-fold cross-validation

7.  When using resampling methods, what is a good default split of the dataset?
•   60/40 for train/test is a good default split of the data

8.  Fill-in-the-blank: _________  _________ are selected and removed from the copied dataset and added to the train dataset until the train dataset contains the target number of rows.
•   Random rows

9.  What is the randrande ( ) function used for?
•   used to generate a random integer in the range between 0 and the size of the list

10. Why is the random seed fixed before splitting the training dataset?
•   ensures the exact same split of the data is made every time the code is executed.

11. What is a limitation of using the training and test split method?
•   you get a noisy estimate of algorithm performance.

12. How does the k-fold cross-validation method provide a more accurate estimate of model performance?
•   Split the data into k groups then trained the algorithm and evaluate it k times. Then summarize the performance by taking the mean performance score.

13. What is a fold?
•   It is the name given to each group of data

14. Explain the k-fold cross-validation method.
•   First split data into k groups, then train and evaluate k times and summarize performance by taking the mean performance score.

15. When using the k-fold cross-validation method, what is the relationship between k and the number of rows in your training dataset?
•   Each of the k groups has the same number of rows

16. What is a good default value for k when using large datasets?
•   K=10

17. What is the cross-validation-split ( ) function used for?
The cross-validation-split() function typically takes as input a dataset and a parameter indicating the number of folds to use. It then splits the dataset into that number of folds and returns an iterator that can be used to iterate over the folds. For each iteration, the function returns the training data and validation data for that fold, allowing the model to be trained and evaluated on each fold in turn.

18. How do we calculate the size of each fold?
•   It is the size of the dataset divided by the number of folds required

19. Fill-in-the-blank: The _______  ________ gives a robust estimate of performance compared to other methods.
•   k-fold cross-validation

20. What is the downside of cross-validation?
•   it can be time consuming to run, requiring k different models to be trained and evaluated

21. Why is the train and test split resampling method so widely used?
•   Because it is easy to understand and implement, and because it gives a quick estimate of algorithm performance

22. List two functions found in this section and explain what they are used for.
•   randrange function - generate a random integer in the range between 0 and the size of the list.
•   train_test_split function - splits a dataset into a train and test split.