Machine Learning 1 – Assignment 1A Name: __________________
Spring 2023 Date: ________________
Directions: Complete the following exercises.
1. When preprocessing your data in either R or Python, what are some of the transforms used to clean the data? (List at least three.)
• Normalizing data, scaling data, Handling missing values
2. What is normalization? Explain.
• refers to rescaling an input variable to the range between 0 and 1. Normalization requires that you know the minimum and maximum values for each attribute.
3. What is the function, dataset minimax ( ), used for?
• It calculates the min and max value for each attribute in the dataset, then returns an array of these minimum and maximum values.
4. Fill-in-the-blank: The __________ and __________ values can be estimated from the training data or specified directly.
• Minimum and maximum
5. Fill-in-the-blank: The ___________ format in Listing 2.3 (P.10) is the output of calculating the min and max values.
• List of list
6. TRUE or FALSE: Once we have estimates of the max and min allowed values for each column, we can now normalize the raw data to the range 0 to 1.
• True
7. What is scaled value used for?
• It is used to calculate and normalize a single value
8. Which function is implemented to normalize values in each column of a provided dataset?
• Normalize dataset
9. Fill-in-the-blank: The ________ and __________ for each column are estimated from the dataset ...
• Minimum and maximum
10. What are the dimensions of our selected Pima-Indians-Diabetes – dataset?
• 768*9
11. Fill-in-the-blank: The ______ ______ ______ ________ ______ is printed before and after normalization, showing the effect of scaling.
• first record from the dataset
12. Define and state standardization.
• Standardization is a rescaling technique that refers to centering the distribution of the data on the value 0 and the standard deviation to the value of 1.
13. What is another name for the normal distribution?
• Gaussian distribution
14. We can estimate the mean and standard deviation from which part of the dataset?
• Each column
15. The mean for a column is calculated as ___________________________ divided by the total number of values.
• the sum of all values for a column
16. Fill-in-the-blank: For a loaded dataset, we can refer to this dataset as a _______ _________
________ , and the second list being the list of column values for a given row.
• list of observations or rows
• list of column values
17. TRUE or FALSE: If your data is not a Gaussain distribution, consider normalizing it only after applying your machine learning algorithm.
• False
18. List at least five functions listed in this section and explain what they are used for.
• standardize_dataset( ) function - estimates the mean and standard deviation summary statistics
• reader( ) function - takes a file as an argument
• load_csv( ) function – takes a file name and return a dataset.
• dataset-minimax( ) function - calculates the min and max value for each attribute in the dataset, then returns an array of these minimum and maximum values.
• column_means ( ) - calculates the mean values for each column in the dataset.