Machine Learning 1 – Assignment 1A                  Name: __________________
Spring 2023                                          Date: ________________
Directions: Complete the following exercises.

1. When preprocessing your data in either R or Python, what are some of the transforms used to clean the data? (List at least three.)
•   Normalizing data, scaling data, Handling missing values

2. What is normalization? Explain.
•   refers to rescaling an input variable to the range between 0 and 1.  Normalization requires that you know the minimum and maximum values for each attribute.

3. What is the function, dataset minimax ( ), used for?
•   It calculates the min and max value for each attribute in the dataset, then returns an array of these minimum and maximum values.

4. Fill-in-the-blank: The __________ and __________ values can be estimated from the training data or specified directly.
•   Minimum and maximum

5. Fill-in-the-blank: The ___________ format in Listing 2.3 (P.10) is the output of calculating the min and max values.
•   List of list 

6. TRUE or FALSE: Once we have estimates of the max and min allowed values for each column, we can now normalize the raw data to the range 0 to 1.
•   True

7. What is scaled value used for?
•   It is used to calculate and normalize a single value  


8. Which function is implemented to normalize values in each column of a provided dataset?
•   Normalize dataset

9. Fill-in-the-blank: The ________ and __________ for each column are estimated from the dataset ...
•   Minimum and maximum

10. What are the dimensions of our selected Pima-Indians-Diabetes – dataset?

•   768*9
11. Fill-in-the-blank: The ______ ______ ______ ________ ______ is printed before and after normalization, showing the effect of scaling.
•   first record from the dataset

12. Define and state standardization.
•   Standardization is a rescaling technique that refers to     centering the distribution of the data on the value 0 and the standard deviation to the value of 1.

13. What is another name for the normal distribution?
•   Gaussian distribution

14. We can estimate the mean and standard deviation from which part of the dataset?

•   Each column

15. The mean for a column is calculated as ___________________________ divided by the total number of values.
•   the sum of all values for a column

16. Fill-in-the-blank: For a loaded dataset, we can refer to this dataset as a _______ _________
________ , and the second list being the list of column values for a given row.

•   list of observations or rows 
•   list of column values

17. TRUE or FALSE: If your data is not a Gaussain distribution, consider normalizing it only after applying your machine learning algorithm.

•   False

18. List at least five functions listed in this section and explain what they are used for.
•   standardize_dataset( ) function - estimates the mean and standard deviation summary statistics
•   reader( ) function - takes a file as an argument
•   load_csv( ) function – takes a file name and return a dataset.
•   dataset-minimax( ) function - calculates the min and max value for each attribute in the dataset, then returns an array of these minimum and maximum values.
•   column_means ( ) - calculates the mean values for each column in the dataset.