Assignment 1A (Chapter 2) 
Machine Learning 1 – Assignment 1A                                        Spring 2026                                                               
•   Available Mar 16 at 12am - Mar 21 at 11:59pm

Directions: Complete the following exercises.

1. When preprocessing your data in either R or Python, what are some of the transforms used to clean the data? (List at least three.)
-->
Standardization, normalization, exponential, power transforms,
Log transformation, Square root transformation and Box-Cox transformation

2. What is normalization? Explain.
--->
Normalization is a preprocessing technique used to rescale numerical data to a 
Normalization refers to different techniques depending on context: one
to refer to rescaling an input variable to the range between 0 and 1,  Min-Max Normalization (Rescaling). 

\[ scalled(x_i) = \frac{x_i - x_{min} }{x_{max} - x_{min}} \].

3. What is the function, dataset_minimax ( ), used for?
--->
dataset-minimax( ) function calculates the min and max value for each attribute in the dataset, then returns an array of these minimum and maximum values.

4. Fill-in-the-blank: 
The __________ and __________ values can be estimated from the training data or specified directly.
--->
The minimum and maximum values for each column can be estimated from training data or specified directly if you have deep knowledge of the problem domain.


5. Fill-in-the-blank: 
The ___________ format in Listing 2.3 (P.10) is the output of calculating the min and max values.
--->
The dataset is printed in a list-of-lists format, then the min and max for each column is printed in the format column1: min,max and column2: min,max
6. TRUE or FALSE: Once we have estimates of the max and min allowed values for each column, we can now normalize the raw data to the range 0 to 1.
--->
TRUE

7. What is scaled value used for?
--->
Scaled values are useful when there covariates have values measurement in different scale and useful for improving model.

8. Which function is implemented to normalize values in each column of a provided dataset?
--->

\[ scalled(x_i) = \frac{x_i - x_{min} }{x_{max} - x_{min}} \].

9. Fill-in-the-blank: 
The ________ and __________ for each column are estimated from the
dataset ...
--->
 Minimum and Maximum
 Training dataset
 
10. What are the dimensions of our selected Pima-Indians-Diabetes – dataset?
--->

11. Fill-in-the-blank: 
The ______ ______ ______ ________ ______ is printed before and after
normalization, showing the effect of scaling.
--->
The first record from the dataset is printed before and after normalization, showing the effect of scaling.

12. Define and state standardization.
--->
Standardization is a rescaling technique that refers to                 centering the distribution of the data on the value 0 and               the standard deviation to the value of 1.

13. What is another name for the normal distribution?
--->
Gaussian distribution

14. We can estimate the mean and standard deviation from which part of the dataset?
--->
Training data or can use experte knowledge.


15. The mean for a column is calculated as ___________________________ divided by the total number of values.
--->
The mean for a column is calculated as the sum of all values for a column divided by the total number of values.

16. Fill-in-the-blank: 
For a loaded dataset, we can refer to this dataset as a _______ _________
________ , and the second list being the list of column values for a given row.
--->
The first list is a list of observations or rows, and the second list is the list of column values for a given row.


17. TRUE or FALSE: 
If your data is not a Gaussain distribution, consider normalizing it only after
applying your machine learning algorithm.
--->
FALSE 

18. List at least five functions listed in this section and explain what they are used for.
-->
(i). Following python codes difine a dataset_minmax() function with argument dataset, will return minimum and maximum values for each column (variable) in the dataset

# Find the min and max values for each column
def dataset_minmax(dataset):
minmax = list()
for i in range(len(dataset[0])):
col_values = [row[i] for row in dataset]
value_min = min(col_values)
value_max = max(col_values)
minmax.append([value_min, value_max])
return minmax

(ii). Following commands create a sample dataset as a list object and prints.

dataset = [[50, 30], [20, 90]]
print(dataset)

(iii). Following command invodes reader() function fomr csv module (like package in R).

# Example of normalizing the diabetes dataset
from csv import reader

(iv). Following code defines load_csv() function with argument filename. This function will read row which are not empty in the dataset.

# Load a CSV file
def load_csv(filename):
    dataset = list()
    with open(filename, 'r') as file:
        csv_reader = reader(file)
        for row in csv_reader:
            if not row:
                continue
            dataset.append(row)
    return dataset
    
    (v). Following command read pima-indians-diabetes dataset with the load_csv() function defined above.
    
    # Load pima-indians-diabetes dataset
filename = 'pima-indians-diabetes.csv'
dataset = load_csv(filename)