Assignment 1A (Chapter 2)
Machine Learning 1 – Assignment 1A Spring 2026
• Available Mar 16 at 12am - Mar 21 at 11:59pm
Directions: Complete the following exercises.
1. When preprocessing your data in either R or Python, what are some of the transforms used to clean the data? (List at least three.)
-->
Standardization, normalization, exponential, power transforms,
Log transformation, Square root transformation and Box-Cox transformation
2. What is normalization? Explain.
--->
Normalization is a preprocessing technique used to rescale numerical data to a
Normalization refers to different techniques depending on context: one
to refer to rescaling an input variable to the range between 0 and 1, Min-Max Normalization (Rescaling).
\[ scalled(x_i) = \frac{x_i - x_{min} }{x_{max} - x_{min}} \].
3. What is the function, dataset_minimax ( ), used for?
--->
dataset-minimax( ) function calculates the min and max value for each attribute in the dataset, then returns an array of these minimum and maximum values.
4. Fill-in-the-blank:
The __________ and __________ values can be estimated from the training data or specified directly.
--->
The minimum and maximum values for each column can be estimated from training data or specified directly if you have deep knowledge of the problem domain.
5. Fill-in-the-blank:
The ___________ format in Listing 2.3 (P.10) is the output of calculating the min and max values.
--->
The dataset is printed in a list-of-lists format, then the min and max for each column is printed in the format column1: min,max and column2: min,max
6. TRUE or FALSE: Once we have estimates of the max and min allowed values for each column, we can now normalize the raw data to the range 0 to 1.
--->
TRUE
7. What is scaled value used for?
--->
Scaled values are useful when there covariates have values measurement in different scale and useful for improving model.
8. Which function is implemented to normalize values in each column of a provided dataset?
--->
\[ scalled(x_i) = \frac{x_i - x_{min} }{x_{max} - x_{min}} \].
9. Fill-in-the-blank:
The ________ and __________ for each column are estimated from the
dataset ...
--->
Minimum and Maximum
Training dataset
10. What are the dimensions of our selected Pima-Indians-Diabetes – dataset?
--->
11. Fill-in-the-blank:
The ______ ______ ______ ________ ______ is printed before and after
normalization, showing the effect of scaling.
--->
The first record from the dataset is printed before and after normalization, showing the effect of scaling.
12. Define and state standardization.
--->
Standardization is a rescaling technique that refers to centering the distribution of the data on the value 0 and the standard deviation to the value of 1.
13. What is another name for the normal distribution?
--->
Gaussian distribution
14. We can estimate the mean and standard deviation from which part of the dataset?
--->
Training data or can use experte knowledge.
15. The mean for a column is calculated as ___________________________ divided by the total number of values.
--->
The mean for a column is calculated as the sum of all values for a column divided by the total number of values.
16. Fill-in-the-blank:
For a loaded dataset, we can refer to this dataset as a _______ _________
________ , and the second list being the list of column values for a given row.
--->
The first list is a list of observations or rows, and the second list is the list of column values for a given row.
17. TRUE or FALSE:
If your data is not a Gaussain distribution, consider normalizing it only after
applying your machine learning algorithm.
--->
FALSE
18. List at least five functions listed in this section and explain what they are used for.
-->
(i). Following python codes difine a dataset_minmax() function with argument dataset, will return minimum and maximum values for each column (variable) in the dataset
# Find the min and max values for each column
def dataset_minmax(dataset):
minmax = list()
for i in range(len(dataset[0])):
col_values = [row[i] for row in dataset]
value_min = min(col_values)
value_max = max(col_values)
minmax.append([value_min, value_max])
return minmax
(ii). Following commands create a sample dataset as a list object and prints.
dataset = [[50, 30], [20, 90]]
print(dataset)
(iii). Following command invodes reader() function fomr csv module (like package in R).
# Example of normalizing the diabetes dataset
from csv import reader
(iv). Following code defines load_csv() function with argument filename. This function will read row which are not empty in the dataset.
# Load a CSV file
def load_csv(filename):
dataset = list()
with open(filename, 'r') as file:
csv_reader = reader(file)
for row in csv_reader:
if not row:
continue
dataset.append(row)
return dataset
(v). Following command read pima-indians-diabetes dataset with the load_csv() function defined above.
# Load pima-indians-diabetes dataset
filename = 'pima-indians-diabetes.csv'
dataset = load_csv(filename)