Data Engineering and Mining II
Name: ____Paul Brown______
Fall 2022 - Spring 2023
Assignment 8 - Model Performance Evaluation - Part 1
Directions: Complete the following exercises. Fill-in-the-blank:
1. _________ ____________ is the art and science of intelligent data analysis.
• Data Mining
2. What is the aim of data science?
• The main objective of data science is to discover patterns in data. It makes sense of the data through a variety of statistical techniques. After data extraction, wrangling, and pre-processing, a data scientist must carefully examine the data.
3. Fill-in-the-blank:
We often describe _________ ___________ as the process of building models.
• Machine learning
4. What can we do to evaluate a model?
• identify a dataset on which to perform the evaluation.
5. What is the problem with evaluating our model on the training dataset?
• does not give us a very good idea of how well the model will perform in general on previously unseen data
6. Why do we use validation and/or testing datasets? (We can obtain an error rate)
• We use the validation dataset during the modeling process to build the final model which is identified as the “best” model. Last step is to assess the model’s performance on the testing dataset.
7. What is the validation dataset used for?
• the validation dataset is used during the modeling process to build the final model.
8. What is a dataset?
• a collection of data
9. Fill-in-the-blank:
The ________ of a dataset refers to the number of observations (rows) and the number of variables (columns).
• dimension
10. What variables are also known as target, response or dependent variables?
• Output variables
11. How do we represent values such as the names or the qualities of objects in data mining?
• Character strings
12. How do we represent values such as quantitative values in data mining?
• numerical
13. Categorical variables are always discrete. Give three examples of categorical variables.
• Gender, race, color
14. Categoric variables may also be known as factors. TRUE or FALSE
• True
15. A numeric variable has values that are real numbers, such as a person’s age or weight.
• True
TRUE or FALSE
16. For building a predictive model, we often partition a dataset into three independent datasets. What are those three types of datasets?
• a training dataset, a validation dataset, and a testing dataset.
17. Fill-in-the-blank:
When building a model, we build our model using the training dataset. The ___________ __________ is used to assess the model’s performance.
• validation
18. Fill-in-the-blank:
Once we are satisfied with the model, we assess its expected performance into the future using the ___________ ____________ .
• Testing data
19. Which dataset must be a holdout or out-of-sample dataset?
Testing dataset
21. Which dataset consists of randomly selected observations from the full dataset that are not used in the building of the model?
• Testing data
22. When evaluating model performance, what is the name of the table that compares predictions with actual answers?
• Confusion matrix
23. In applying a model to a new dataset, the new dataset must contain all of the same variables and have different data types. TRUE or FALSE
• True
24. When working with some measures of performance of a model, we sometimes use a binary classification model. When working with binary classification, we often identify the predictions as ___________ and ____________ classes.
• postitive and negative
25. List two machine learning algorithm categories.
• Predictive and descriptive
26. Fill-in-the-blanks:
A ____________ _____________ is used for tasks that involve the prediction of one value using other values in the dataset.
• predictive
26. The machine learning algorithm attempts to discover and model the relationship between the _________ _________ and the other __________ .
• Target feature and feature
27. Fill-in-the-blank:
Because predictive models are given clear instruction on what they need to learn and how they are intended to learn it, the process of training a predictive model is known as _________ _________ .
• Supervised models
28. Fill-in-the-blank:
The often used supervised machine learning task of predicting which category an example belongs to is known as ________________ .
• classification
29. Complete the following statement:
A random forest algorithm is used for tasks that involve _________...
• Classification and Regression problems combining the output of multiple decision trees to reach a single result.
30. Fill-in-the-blank:
The _________ _________ is calculated as the proportion of observations for which the model incorrectly predicts the class with respect to the actual class.
• Error rate
31. What is a true positive?
• When model predicts true and actual is true
32. What is a false negative?
• When model predicts false and actual is true
33. What are the two types of classification errors?
• False negative and false positive
34. Fill-in-the-blank:
The precision of a model is a measure of how accurate the positive predictions are, or how precise the model is in ___________ .
• predicating
35. Fill-in-the-blank:
The recall of a model is just another name for the true positive rate. The recall is also known as the ___________ of the model.
• sensitivity of the model