# Assignment 7 – Random Forest Model – Part 2
## Section A:
## 1. Using the caret package in R, how do we
## estimate the accuracy of model performance on ## unseen data?
## • This is typically done by estimating
## accuracy using data that was not used to
## train the model. The accuracy of the model ## predictions on data seen during training can ## be used as an estimate for the accuracy
## of the model on unseen data.
## 2. TRUE or FALSE: The accuracy of the model
## predictions on data seen during training
## cannot be used as an estimate for the accuracy ## of the model on unseen data in general.
## • True
## 3. Explain resampling methods
## • they take multiple samples or make multiple ## splits of your dataset into portions that
## you can use for model training, and model
## testing.
## 4. List two resampling methods
## • Data splitting
## • k-fold cross-validation method
## 5. State a popular dataset split (Hint: use the ## Iris dataset/Naïve Bayes model example)
## • 80/20 or 70/30 are popular data splits for ## training and testing data
## 6. In the estimating model accuracy-resampling ## methods algorithm, what does “iris$Species” ## mean?
## • Access the species variable in the iris
## dataset
## 7. In the estimating model accuracy-resampling ## methods algorithm, what does “predictions
## class” mean?
## • In the confusion matrix it uses the new
## predicted data set and accesses the class ## variable
## 8. In the estimating model accuracy-resampling
## methods algorithm, explain “iris[ -
## trainIndex, ]” mean?
## • Splitting the Iris dataset into the training ## data set
## 9. Explain k-fold cross-validation.
## • involves splitting the dataset into
## k-subsets. Each subset is held out while
## the model is trained on all other subsets. ## This process is repeated until accuracy is ## determined for each instance in the dataset, ## and an overall accuracy estimate is
## provided.
## 10. What are other names for “instances?”
## • True Positive + True Negatives / (True
## Positive + True Negative + False Positive ## +False Negative)
## 11. In the k-fold cross-validation algorithm,
## what does “cv” stand for?
## • Cross validation
## 12. In the k-fold cross-validation algorithm,
## what does “Species~., “ mean?
## • Species column to be predicated by the
## data in all the other columns
## 13. Fill-in-the-blank: We can estimate the
## accuracy of our model on test data using
## _________ ________ ___________ .
## • on unseen data
## 14. Explain the difference between the Accuracy
## metric and the RMSE metric.
## • accuracy for classification problems and
## RMSE for regression problems
## 15. What are the caret package metrics for
## evaluating your model?
## • RMSE – for numeric models
## 16. Fill-in-the-blank: ___________ and
## ___________ are the default metrics used to
## evaluate algorithms on binary and multiclass ## classification datasets in caret. They are
## also associated with error rates.
## • Accuracy and Kappa
## 17. Explain the accuracy evaluation metric.
## • It is the percentage of correctly
## classified instances out of all instances
## 18. What do the arguments, “method=”cv” ,
## number=5” mean?
## • Uses a 5-fold cross validation to provide ## estimate on the longley dataset
## 19. What does “diabetes~., “ mean?
## • Diabetes column to be predicated by the
## data in all the other columns
## 20. Explain the coefficient of determination.
## • It provides a goodness-of-fit measure for ## the predictions to the observations. This ## is a value between 0 and 1 for no-fit and ## perfect fit respectively.
## 21. What is a goodness-of-fit measure? Explain.
## • describes how well it fits a set of
## observations. Measures of goodness-of-fit ## typically summarize the discrepancy
## between observed values and the values
## expected under the model in question
## 22. Fill-in-the-blank: A __________ _________ ## is an ensemble of unpruned decision trees.
## • Random forest
## 23. TRUE or FALSE: The error rate, associated ## with a random forest model, is not robust ## to noise in the training dataset.
## • false
## 24. Explain bagging.
## • Each decision tree is built from a
## random subset of the training dataset, ## using what is called replacement (thus, ## it is doing what is known as bagging) in ## performing this sampling
## 25. TRUE or FALSE: The random forest model
## builder, in randomly selecting the dataset, ## introduces randomness. Because of this
## randomness, the model delivers robustness to ## changes in the test data.
## • True
## 26. List two levels of randomness.
## • observations and variables
# Section B:
## Directions: Run the “A Standalone Model.” Then ## answer the following questions.
## 1. What is the estimated accuracy of the optimal ## configuration?
## 2. What is the accuracy of the final standalone ## model trained on all of the training dataset ## and predicted for the validation dataset?
## • Accuracy : 0.7805
## 3. What is the number of variables tried at each ## split?
## • 2
## 4. What is the number of trees requested by the ## “ntree=” argument?
## • 2000
## 5. Which package is using the randomForest
## package and, in turn, using the randomForest( ## ) function?
## • Caret and mlbench
## 6. What is the Kappa value?
## • 0.7036
## 7. What is the sensitivity value?
## • 0.9091
## 8. What is the p-value?
## • 0.6831
## 9. What are the sensitivity and specificity
## values?
## • Sensitivity : 0.9091
## • Specificity : 0.7895
## 10. Explain “C~., “
## • C column to be predicated by the data in ## all the other columns
## 11.What is a goodness-of-fit measure?
## • It describes how well a set of observations ## fit. It summarizes the discrepancy between
## observed values and the values expected under ## the model.
## 12.Name two default metrics of the caret package ## used for evaluating your model.
## • accuracy for classification problems and
## RMSE for regression problems