Assignment #9: I - Data Science II

## Data Engineering and Mining II                                                                       
## Name: Paul Brown
## Fall 2022                                        ## Assignment 9 – Part 1
Directions:  Complete the following exercises.

1.  Fill-in-the-blank: Using the _______________ resampling method, you can get an estimate for how accurate each model may be on unseen data.
    •   k-fold cross-validation

2.  The Pima Indians diabetes dataset comes from which package in R
    •   mlbench

3.  The function, data( ), does what to the Pima Indians dataset?
    •   It loads the Pimi Indian dataset into rstudio

4.  The evaluation metrics, alpha and kappa, are used for what task?
    •   To determine how well the model is performing

5.  List the five algorithms that we are comparing in this section.

    •   CART – Classification and Regression Trees
    •   LDA – Linear Discriminant Analysis
    •   SVM – Support Vector Machine with Radial Basis             Function
    •   KNN – k-Nearest Neighbors
    •   RF –  Random Forest

6.  The resamples ( ) function checks two tasks. What are they?

    •   that the models are comparable, and  
    •   that they used the same training scheme

7.  TRUE or FALSE: The trainControl configuration is a training scheme.
    •   TRUE

8.  Explain the argument, “trControl = trainControl.”

   specifies the resampling scheme, that is, how       cross-validation should be performed to find the    best values of the tuning parameters …

9.  What are two resampling methods used to estimate the accuracy of models?
    •   data split and k-fold cross-validation

10. What is the summary ( ) function used for?
    To summarize model resample statistics

11. List at least three forms of data visualization.
    •   Box and Whisker Plots 
    •   Density Plots 
    •   Dot Plots 

12. Explain the argument, “ pch = “|” “
    standard argument to set the character that will    be plotted in a number of R functions. 

13. What are the three tasks of the test harness?
   •    The resampling method to split-up the dataset
   •    The machine learning algorithm to evaluate
   •    The performance measure metric by which to evaluate predictions

14. In this homework assignment, why do we design a test harness?
   •    to evaluate different machine learning algorithms.

15. In this homework, why are we using the 10-fold cross-validation?
   •    To estimate accuracy

16. TRUE or FALSE: When we use 10-fold cross-validation, the dataset may be split into 10 parts where the algorithm trains in 9 parts and test on 1 part, and repeat for all combinations of train-test splits. 
   •    True

17. TRUE or FALSE: The error rate equals 1 + accuracy.
    False
18. TRUE or FALSE: Accuracy may be used to evaluate models.
   •    True

19. Fill-in-the-blank: The __________  ___________  is a simple resampling method that can be used to evaluate a machine learning algorithm.
   •    Train, test, split

20. When creating a dataset, why do we use resampling methods?
   •    To estimate the accuracy of a model

21. When creating a validation dataset, why do we use statistical methods?
   •    To estimate the accuracy of the models that we train on part of the dataset.

22. List two processes of the resampling methods.
    Splitting a dataset to get two sub-datasets or using a k-fold cross-validation sub-dataset.

23. For the test harness, after we create sub-datasets and, once we implement the model, we go on to do what?
    Use metrics to evaluate the performance of the model.

24. Fill-in-the-blank: When evaluating some algorithms, we may wish to create some models of the data and estimate their _________________________________ .
    accuracy on unseen data

25. In this section, what are the five algorithms that we are testing?
   •    Linear Discriminant Analysis (LDA).
   •    Classification and Regression Trees (CART).
   •    k – Nearest Neighbors (KNN). 
   •    Support Vector Machines (SVM) with a radial kernel (for this example) RF).

26. In R, the _________ ___________  does support the configuration and tuning of the configuration of each model.
    Caret package

27. According to Jason Brownlee (the author of our text), what can we do to compare the spread and the mean accuracy of each model in our program?
   •    Can create a plot of the model evaluation 

28. In R, what performance metrics do we use for regression problems?

   •    RMSE

29. After running the code to calculate and summarize statistical significance, we get a table of pairwise statistical significance scores. What does the lower diagonal of the table show us?

   shows p-values for the null hypothesis (distributions are the same), smaller is better.

30. What is the algorithm, random forest, used for?
   •    Random Forest is a classifier that contains several decision trees on various subsets of a given dataset and takes the average to enhance the predicted accuracy of that dataset. Instead of relying on a single decision tree, the random forest collects the result from each tree and expects the final output based on the majority votes of predictions.

31. TRUE or FALSE:  After comparing the algorithms, when we make predictions of the best model, we will first train the best model, then we will use the model on the testing set, never the validation set  in a train-validation-test set setup.
   •    False
Assignment #9: I - Data Science II

Paul Brown

2/10/2023