Random Forest - Assignment 7

# Assignment 7 – Random Forest Model – Part 2
 
## Section A:  
 
## 1. Using the caret package in R, how do we 
##    estimate the accuracy of model performance on ##    unseen data?
##    • This is typically done by estimating  
##      accuracy using data that was not used to 
##      train the model. The accuracy of the model ##      predictions on data seen during training can ##      be used as an estimate for the accuracy 
##      of the model on unseen data.

## 2. TRUE or FALSE:  The accuracy of the model 
##    predictions on data seen during training 
##    cannot be used as an estimate for the accuracy ##    of the model on unseen data in general.
##    • True

## 3. Explain resampling methods
##    • they take multiple samples or make multiple ##      splits of your dataset into portions that 
##      you can use for model training, and model 
##      testing.

## 4. List two resampling methods
##    • Data splitting
##    • k-fold cross-validation method

## 5. State a popular dataset split (Hint: use the ##    Iris dataset/Naïve Bayes model example)

##    • 80/20 or 70/30 are popular data splits for ##      training and testing data

##  6.  In the estimating model accuracy-resampling ##      methods algorithm, what does “iris$Species” ##      mean?
##      •   Access the species variable in the iris 
##        dataset

## 7.  In the estimating model accuracy-resampling ##     methods algorithm, what does “predictions 
##     class” mean?
##     •    In the confusion matrix it uses the new 
##        predicted data set and accesses the class ##        variable
## 8. In the estimating model accuracy-resampling 
##    methods algorithm, explain “iris[ - 
##    trainIndex, ]” mean?
##    • Splitting the Iris dataset into the training ##      data set

## 9. Explain k-fold cross-validation.
##    • involves splitting the dataset into 
##      k-subsets. Each subset is held out while 
##      the model is trained on all other subsets.  ##      This process is repeated until accuracy is ##      determined for each instance in the dataset, ##      and an overall accuracy estimate is 
##      provided.

## 10. What are other names for “instances?”
##     •    True Positive + True Negatives / (True 
##        Positive + True Negative + False Positive ##        +False Negative)

## 11. In the k-fold cross-validation algorithm, 
##     what does “cv” stand for?
##     •    Cross validation

## 12. In the k-fold cross-validation algorithm, 
##     what does “Species~., “  mean?
##     •    Species column to be predicated by the 
##        data in all the other columns
## 13. Fill-in-the-blank:  We can estimate the 
##     accuracy of our model on test data using 
##     _________ ________ ___________ .
##     •    on unseen data

## 14. Explain the difference between the Accuracy 
##     metric and the RMSE metric.
##    • accuracy for classification problems and 
##      RMSE for regression problems
## 15. What are the caret package metrics for 
##      evaluating your model?
##    • RMSE – for numeric models
## 16. Fill-in-the-blank:    ___________ and 
##    ___________ are the default metrics used to 
##    evaluate algorithms on binary and multiclass ##    classification datasets in caret.  They are 
##    also associated with error rates.
##    • Accuracy and Kappa

## 17. Explain the accuracy evaluation metric.
##     •    It is the percentage of correctly 
##     classified instances out of all instances
## 18. What do the arguments, “method=”cv” , 
##     number=5”  mean?
##     •    Uses a 5-fold cross validation to provide ##     estimate on the longley dataset

## 19. What does “diabetes~., “  mean?
##     •    Diabetes column to be predicated by the 
##        data in all the other columns

## 20. Explain the coefficient of determination.
##     •    It provides a goodness-of-fit measure for ##        the predictions to the observations.  This ##        is a value between 0 and 1 for no-fit and ##        perfect fit respectively.


## 21. What is a goodness-of-fit measure?  Explain.
##     •    describes how well it fits a set of 
##        observations.  Measures of goodness-of-fit ##        typically summarize the discrepancy 
##        between observed values and the values 
##        expected under the model in question

## 22.  Fill-in-the-blank:  A __________  _________ ##      is an ensemble of unpruned decision trees.
##      •   Random forest

## 23.  TRUE or FALSE:  The error rate, associated ##       with a random forest model, is not robust ##       to noise in the training dataset.
##       •  false

## 24.   Explain bagging.
##       •  Each decision tree is built from a 
##          random subset of the training dataset, ##          using what is called replacement (thus, ##          it is doing what is known as bagging) in ##          performing this sampling

## 25. TRUE or FALSE:  The random forest model 
##     builder, in randomly selecting the dataset, ##     introduces randomness.  Because of this 
##     randomness, the model delivers robustness to ##     changes in the test data.
##     •    True

## 26. List two levels of randomness.
##     •    observations and variables

# Section B:
 
## Directions:  Run the “A Standalone Model.”  Then ## answer the following questions.
 
 
## 1.  What is the estimated accuracy of the optimal ##     configuration?
## 2.  What is the accuracy of the final standalone ##     model trained on all of the training dataset ##     and predicted for the validation dataset?
##     •    Accuracy : 0.7805

## 3.  What is the number of variables tried at each ##     split?
##     •    2
## 4.  What is the number of trees requested by the ##     “ntree=” argument?
##     •    2000

## 5.  Which package is using the randomForest 
##     package and, in turn, using the randomForest( ##     ) function?
##     •    Caret and mlbench
## 6.  What is the Kappa value?
##     •    0.7036
## 7.  What is the sensitivity value?
##     •    0.9091

## 8.  What is the p-value?
##     •    0.6831
## 9.  What are the sensitivity and specificity 
##     values?
##     •    Sensitivity : 0.9091          
##     •    Specificity : 0.7895
## 10. Explain “C~., “
##     •    C column to be predicated by the data in ##        all the other columns

## 11.What is a goodness-of-fit measure?
##    • It describes how well a set of observations ##    fit. It summarizes the discrepancy between 
##    observed values and the values expected under ##    the model.
## 12.Name two default metrics of the caret package ##    used for evaluating your model.
##    • accuracy for classification problems and 
##      RMSE for regression problems

Random Forest - Assignment 7 - Part 2

Paul Brown

1/16/2023