Data mining is a process of data. We will cover a few types of data mining problems:
Classification
Prediction
Clustering
The first two are called supervised learning methods and the third is an unsupervised learning method.
Generally, there are 4 steps in a data mining problem:
Problem definition: Define the problem clearly. It should be related to one or more objectives. Describe the deliverables so that the team can be focused on delivering the solution and provide correct expectations to clients interested in the outcome of the project. The project team should be best multidisciplinary and has a leader. A plan for the project should be developed with a timeline and a budget.A cost-benefit analysis can form a basis for a go/no-go decision for the project.
Data Preparation: The quality of data is the most important aspect that influences the quality of the results from the analysis. Data preparation may include removing potential errors, eliminating variables highly correlated to other ones, re-scaling data, and partition data in smaller sets. All the steps in data preparation should be documented.
Implementation of Analysis: This includes (1) data summary with tables, graphs, as well as descriptive statistics and (2) model training with training data, model selection with validation data, model evaluation with test data.
Deployment: Outline the analysis and results. Train the practitioners how to use and interpret the results. Performance should be measured, monitored, and documented. A plan for updating the model should also be developed.
# data frame contain empty rows or columns
df <- data.frame(
x1 = c(10, 12, NA),
x2 = c(24, 15, 8),
x3 = c(NA, NA, NA)
)
# display df
df
## x1 x2 x3
## 1 10 24 NA
## 2 12 15 NA
## 3 NA 8 NA
# Remove empty rows and columns
df %>% remove_empty(which = c("rows", "cols"))
## x1 x2
## 1 10 24
## 2 12 15
## 3 NA 8
# Data contain a constant column
df <- data.frame(
x1 = c(1, 2, 3),
x2 = c(4, 5, 6),
x3 = c(7, 7, 7)
)
# Remove constant
df %>% remove_constant()
## x1 x2
## 1 1 4
## 2 2 5
## 3 3 6
# Create an object for storing validation results
report <- data_validation_report()
# Create a function for checking whether a value is in a vector
is.in = function(v){
function(x){return(x %in% v)}
}
# Create a function to check if a value is between two given values.
between = function(a, b){
function(x) {
return(a <= x & x <= b)
}
}
# Create a function to check if a value is numeric.
is.Numeric = function(){
function(x){return(!is.na(as.numeric(x)))}
}
# Data to be validated
D = data.frame(student = 1:5,
height = c(169, 180, "177 cm", "49cm", 192),
gender = c("M", "M", "Female", "male", "F"),
free.throw.rate = c(0.78, 0.90, 1.33, -0.23, 0.55)
)
# Prepare data for validation
validate(D, name = "Verifying student dataset") %>%
validate_cols(is.Numeric(), height, description = "height is numeric") %>%
validate_cols(between(0, 1), free.throw.rate, description = "free throw rate is between 0 and 1") %>%
validate_cols(is.in(c("M", "F")), gender, description = "gender is either M or F") %>%
add_results(report)
## Warning in improper.predicate(x): NAs introduced by coercion
## Warning in improper.predicate(x): NAs introduced by coercion
print(report)
## Validation summary:
## Number of successful validations: 0
## Number of failed validations: 3
## Number of validations with warnings: 0
##
## Advanced view:
##
##
## |table_name |description |type | total_violations|
## |:-------------------------|:----------------------------------|:-----|----------------:|
## |Verifying student dataset |free throw rate is between 0 and 1 |error | 2|
## |Verifying student dataset |gender is either M or F |error | 2|
## |Verifying student dataset |height is numeric |error | 2|
We will use the data frame “nhanes” from the mice package for demo.
# Display the data
nhanes
## age bmi hyp chl
## 1 1 NA NA NA
## 2 2 22.7 1 187
## 3 1 NA 1 187
## 4 3 NA NA NA
## 5 1 20.4 1 113
## 6 3 NA NA 184
## 7 1 22.5 1 118
## 8 1 30.1 1 187
## 9 2 22.0 1 238
## 10 2 NA NA NA
## 11 1 NA NA NA
## 12 2 NA NA NA
## 13 3 21.7 1 206
## 14 2 28.7 2 204
## 15 1 29.6 1 NA
## 16 1 NA NA NA
## 17 3 27.2 2 284
## 18 2 26.3 2 199
## 19 1 35.3 1 218
## 20 3 25.5 2 NA
## 21 1 NA NA NA
## 22 1 33.2 1 229
## 23 1 27.5 1 131
## 24 3 24.9 1 NA
## 25 2 27.4 1 186
# Display missingness
library(Amelia)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.1, built: 2022-11-18)
## ## Copyright (C) 2005-2023 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
missmap(nhanes)
The variable names of the data frame are on the x-axis, while Row numbers are on the y-axis.
The following are the two steps for imputing missing values in a data frame.
# How to impute?
nhanes %>% mice() %>% complete()
##
## iter imp variable
## 1 1 bmi hyp chl
## 1 2 bmi hyp chl
## 1 3 bmi hyp chl
## 1 4 bmi hyp chl
## 1 5 bmi hyp chl
## 2 1 bmi hyp chl
## 2 2 bmi hyp chl
## 2 3 bmi hyp chl
## 2 4 bmi hyp chl
## 2 5 bmi hyp chl
## 3 1 bmi hyp chl
## 3 2 bmi hyp chl
## 3 3 bmi hyp chl
## 3 4 bmi hyp chl
## 3 5 bmi hyp chl
## 4 1 bmi hyp chl
## 4 2 bmi hyp chl
## 4 3 bmi hyp chl
## 4 4 bmi hyp chl
## 4 5 bmi hyp chl
## 5 1 bmi hyp chl
## 5 2 bmi hyp chl
## 5 3 bmi hyp chl
## 5 4 bmi hyp chl
## 5 5 bmi hyp chl
## age bmi hyp chl
## 1 1 30.1 1 199
## 2 2 22.7 1 187
## 3 1 27.2 1 187
## 4 3 35.3 2 284
## 5 1 20.4 1 113
## 6 3 22.5 1 184
## 7 1 22.5 1 118
## 8 1 30.1 1 187
## 9 2 22.0 1 238
## 10 2 25.5 2 238
## 11 1 21.7 1 238
## 12 2 25.5 1 206
## 13 3 21.7 1 206
## 14 2 28.7 2 204
## 15 1 29.6 1 229
## 16 1 30.1 1 229
## 17 3 27.2 2 284
## 18 2 26.3 2 199
## 19 1 35.3 1 218
## 20 3 25.5 2 184
## 21 1 22.0 1 238
## 22 1 33.2 1 229
## 23 1 27.5 1 131
## 24 3 24.9 1 184
## 25 2 27.4 1 186
In supervised learning, we need to partition data into
training set, used to train/fit models
validation set, used to choose the best model (apply each trained model to the validation set and choose the model with the highest performance in terms of some criterion), and
test set, on which the performance of the best model is reported.
usually in the ratio of 60%:20%:20%.
partitionData = function(df, prop = c(0.60, 0.20, 0.20)){
n = nrow(df)
idx = sample(1:n, n)
n1 = round(n*prop[1])
n2 = round(n*prop[2])
n3 = n - n1 - n2
train.idx = idx[1:n1]
valid.idx = idx[(n1+1):(n1+n2)]
test.idx = idx[-c(1:(n1+n2))]
train = df[train.idx, ]
valid = df[valid.idx, ]
test = df[test.idx, ]
if (length(prop) == 2){
list(train = train, valid = valid)
} else{
list(train = train, valid = valid, test = test)
}
}
partitionData(iris)
## $train
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 82 5.5 2.4 3.7 1.0 versicolor
## 31 4.8 3.1 1.6 0.2 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 58 4.9 2.4 3.3 1.0 versicolor
## 104 6.3 2.9 5.6 1.8 virginica
## 77 6.8 2.8 4.8 1.4 versicolor
## 7 4.6 3.4 1.4 0.3 setosa
## 115 5.8 2.8 5.1 2.4 virginica
## 3 4.7 3.2 1.3 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 101 6.3 3.3 6.0 2.5 virginica
## 150 5.9 3.0 5.1 1.8 virginica
## 131 7.4 2.8 6.1 1.9 virginica
## 6 5.4 3.9 1.7 0.4 setosa
## 148 6.5 3.0 5.2 2.0 virginica
## 107 4.9 2.5 4.5 1.7 virginica
## 99 5.1 2.5 3.0 1.1 versicolor
## 48 4.6 3.2 1.4 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 53 6.9 3.1 4.9 1.5 versicolor
## 4 4.6 3.1 1.5 0.2 setosa
## 73 6.3 2.5 4.9 1.5 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 32 5.4 3.4 1.5 0.4 setosa
## 90 5.5 2.5 4.0 1.3 versicolor
## 12 4.8 3.4 1.6 0.2 setosa
## 113 6.8 3.0 5.5 2.1 virginica
## 40 5.1 3.4 1.5 0.2 setosa
## 76 6.6 3.0 4.4 1.4 versicolor
## 85 5.4 3.0 4.5 1.5 versicolor
## 9 4.4 2.9 1.4 0.2 setosa
## 102 5.8 2.7 5.1 1.9 virginica
## 33 5.2 4.1 1.5 0.1 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 123 7.7 2.8 6.7 2.0 virginica
## 124 6.3 2.7 4.9 1.8 virginica
## 37 5.5 3.5 1.3 0.2 setosa
## 114 5.7 2.5 5.0 2.0 virginica
## 109 6.7 2.5 5.8 1.8 virginica
## 125 6.7 3.3 5.7 2.1 virginica
## 95 5.6 2.7 4.2 1.3 versicolor
## 83 5.8 2.7 3.9 1.2 versicolor
## 27 5.0 3.4 1.6 0.4 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 67 5.6 3.0 4.5 1.5 versicolor
## 118 7.7 3.8 6.7 2.2 virginica
## 46 4.8 3.0 1.4 0.3 setosa
## 94 5.0 2.3 3.3 1.0 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 138 6.4 3.1 5.5 1.8 virginica
## 81 5.5 2.4 3.8 1.1 versicolor
## 105 6.5 3.0 5.8 2.2 virginica
## 49 5.3 3.7 1.5 0.2 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 80 5.7 2.6 3.5 1.0 versicolor
## 139 6.0 3.0 4.8 1.8 virginica
## 120 6.0 2.2 5.0 1.5 virginica
## 122 5.6 2.8 4.9 2.0 virginica
## 84 6.0 2.7 5.1 1.6 versicolor
## 66 6.7 3.1 4.4 1.4 versicolor
## 39 4.4 3.0 1.3 0.2 setosa
## 116 6.4 3.2 5.3 2.3 virginica
## 100 5.7 2.8 4.1 1.3 versicolor
## 41 5.0 3.5 1.3 0.3 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 61 5.0 2.0 3.5 1.0 versicolor
## 149 6.2 3.4 5.4 2.3 virginica
## 71 5.9 3.2 4.8 1.8 versicolor
## 145 6.7 3.3 5.7 2.5 virginica
## 1 5.1 3.5 1.4 0.2 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 88 6.3 2.3 4.4 1.3 versicolor
## 11 5.4 3.7 1.5 0.2 setosa
## 72 6.1 2.8 4.0 1.3 versicolor
## 56 5.7 2.8 4.5 1.3 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 141 6.7 3.1 5.6 2.4 virginica
## 86 6.0 3.4 4.5 1.6 versicolor
## 132 7.9 3.8 6.4 2.0 virginica
## 52 6.4 3.2 4.5 1.5 versicolor
## 147 6.3 2.5 5.0 1.9 virginica
## 135 6.1 2.6 5.6 1.4 virginica
## 103 7.1 3.0 5.9 2.1 virginica
## 64 6.1 2.9 4.7 1.4 versicolor
## 2 4.9 3.0 1.4 0.2 setosa
##
## $valid
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 97 5.7 2.9 4.2 1.3 versicolor
## 126 7.2 3.2 6.0 1.8 virginica
## 79 6.0 2.9 4.5 1.5 versicolor
## 111 6.5 3.2 5.1 2.0 virginica
## 112 6.4 2.7 5.3 1.9 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 63 6.0 2.2 4.0 1.0 versicolor
## 91 5.5 2.6 4.4 1.2 versicolor
## 34 5.5 4.2 1.4 0.2 setosa
## 119 7.7 2.6 6.9 2.3 virginica
## 70 5.6 2.5 3.9 1.1 versicolor
## 129 6.4 2.8 5.6 2.1 virginica
## 89 5.6 3.0 4.1 1.3 versicolor
## 50 5.0 3.3 1.4 0.2 setosa
## 133 6.4 2.8 5.6 2.2 virginica
## 10 4.9 3.1 1.5 0.1 setosa
## 134 6.3 2.8 5.1 1.5 virginica
## 16 5.7 4.4 1.5 0.4 setosa
## 68 5.8 2.7 4.1 1.0 versicolor
## 22 5.1 3.7 1.5 0.4 setosa
## 130 7.2 3.0 5.8 1.6 virginica
## 108 7.3 2.9 6.3 1.8 virginica
## 117 6.5 3.0 5.5 1.8 virginica
## 13 4.8 3.0 1.4 0.1 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 55 6.5 2.8 4.6 1.5 versicolor
## 65 5.6 2.9 3.6 1.3 versicolor
## 144 6.8 3.2 5.9 2.3 virginica
## 142 6.9 3.1 5.1 2.3 virginica
## 69 6.2 2.2 4.5 1.5 versicolor
##
## $test
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 127 6.2 2.8 4.8 1.8 virginica
## 51 7.0 3.2 4.7 1.4 versicolor
## 137 6.3 3.4 5.6 2.4 virginica
## 74 6.1 2.8 4.7 1.2 versicolor
## 136 7.7 3.0 6.1 2.3 virginica
## 92 6.1 3.0 4.6 1.4 versicolor
## 59 6.6 2.9 4.6 1.3 versicolor
## 45 5.1 3.8 1.9 0.4 setosa
## 60 5.2 2.7 3.9 1.4 versicolor
## 14 4.3 3.0 1.1 0.1 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 106 7.6 3.0 6.6 2.1 virginica
## 110 7.2 3.6 6.1 2.5 virginica
## 43 4.4 3.2 1.3 0.2 setosa
## 128 6.1 3.0 4.9 1.8 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 8 5.0 3.4 1.5 0.2 setosa
## 78 6.7 3.0 5.0 1.7 versicolor
## 140 6.9 3.1 5.4 2.1 virginica
## 121 6.9 3.2 5.7 2.3 virginica
## 87 6.7 3.1 4.7 1.5 versicolor
## 98 6.2 2.9 4.3 1.3 versicolor
## 5 5.0 3.6 1.4 0.2 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 62 5.9 3.0 4.2 1.5 versicolor
## 15 5.8 4.0 1.2 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 96 5.7 3.0 4.2 1.2 versicolor
If you fit only one model, the validation set is not needed and the test set is called the validation set. Just some naming changes.
We can use the function createFolds() from the caret (classification and regression tree) package to partition vector randomly into k folds of about equal size.
The following code partition the vector 1 to 20 randomly into 3 folds.
createFolds(1:20, 3)
## $Fold1
## [1] 2 7 11 14 16 20
##
## $Fold2
## [1] 1 4 6 8 12 13 17
##
## $Fold3
## [1] 3 5 9 10 15 18 19
The following code partition the mtcars data frame randomly into 3 datasets.
n = nrow(mtcars)
# partition the indices (i.e. row numbers) first. The 3 sets of indices are stored in a list (I here).
I = createFolds(1:n, 3)
# Print I
I
## $Fold1
## [1] 2 6 7 11 14 15 20 22 24 26 29 32
##
## $Fold2
## [1] 1 4 5 9 10 17 23 28 30 31
##
## $Fold3
## [1] 3 8 12 13 16 18 19 21 25 27
# get the 3 datasets
set1 = mtcars[I$Fold1, ]
set2 = mtcars[I$Fold2, ]
set3 = mtcars[I$Fold3, ]
Question: How can you use the createFolds() function to partition the mtcars data frame into training (70%), validation (15%), and test (15%)?
A note: The textbook often just partitions a data frame into training and validation sets. We will follow this practice. In this situtation, you can use the createDataPartion() function instead of createFolds() to create two sets of indices. The syntax is
createDataPartition(1:nrow(YourDataFrame), p = 0.7)
where p is the proportion of data going to the training set.
In machine learning, re-scaling numerical variables (also called features) that have varying scales is a good idea when fitting models such as kNN and neural network.
re-scaling can speed up the convergence of your algorithm and thus increase the validation performance of the model being used.
For a numeric variable say x, commonly used re-scaling methods include
standardization = (x - mean)/(standard deviation), called z-score
normalization = (x - min)/(max - min)
The former (z-score method) is suggested when your model requires features to have normal distributions. When there is no such requirement, use the later (range method).
D = iris
process <- preProcess(D, method="range") # use c("center", "scale") for standardization
normorlized.D <- predict(process, D)
normorlized.D
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 0.22222222 0.62500000 0.06779661 0.04166667 setosa
## 2 0.16666667 0.41666667 0.06779661 0.04166667 setosa
## 3 0.11111111 0.50000000 0.05084746 0.04166667 setosa
## 4 0.08333333 0.45833333 0.08474576 0.04166667 setosa
## 5 0.19444444 0.66666667 0.06779661 0.04166667 setosa
## 6 0.30555556 0.79166667 0.11864407 0.12500000 setosa
## 7 0.08333333 0.58333333 0.06779661 0.08333333 setosa
## 8 0.19444444 0.58333333 0.08474576 0.04166667 setosa
## 9 0.02777778 0.37500000 0.06779661 0.04166667 setosa
## 10 0.16666667 0.45833333 0.08474576 0.00000000 setosa
## 11 0.30555556 0.70833333 0.08474576 0.04166667 setosa
## 12 0.13888889 0.58333333 0.10169492 0.04166667 setosa
## 13 0.13888889 0.41666667 0.06779661 0.00000000 setosa
## 14 0.00000000 0.41666667 0.01694915 0.00000000 setosa
## 15 0.41666667 0.83333333 0.03389831 0.04166667 setosa
## 16 0.38888889 1.00000000 0.08474576 0.12500000 setosa
## 17 0.30555556 0.79166667 0.05084746 0.12500000 setosa
## 18 0.22222222 0.62500000 0.06779661 0.08333333 setosa
## 19 0.38888889 0.75000000 0.11864407 0.08333333 setosa
## 20 0.22222222 0.75000000 0.08474576 0.08333333 setosa
## 21 0.30555556 0.58333333 0.11864407 0.04166667 setosa
## 22 0.22222222 0.70833333 0.08474576 0.12500000 setosa
## 23 0.08333333 0.66666667 0.00000000 0.04166667 setosa
## 24 0.22222222 0.54166667 0.11864407 0.16666667 setosa
## 25 0.13888889 0.58333333 0.15254237 0.04166667 setosa
## 26 0.19444444 0.41666667 0.10169492 0.04166667 setosa
## 27 0.19444444 0.58333333 0.10169492 0.12500000 setosa
## 28 0.25000000 0.62500000 0.08474576 0.04166667 setosa
## 29 0.25000000 0.58333333 0.06779661 0.04166667 setosa
## 30 0.11111111 0.50000000 0.10169492 0.04166667 setosa
## 31 0.13888889 0.45833333 0.10169492 0.04166667 setosa
## 32 0.30555556 0.58333333 0.08474576 0.12500000 setosa
## 33 0.25000000 0.87500000 0.08474576 0.00000000 setosa
## 34 0.33333333 0.91666667 0.06779661 0.04166667 setosa
## 35 0.16666667 0.45833333 0.08474576 0.04166667 setosa
## 36 0.19444444 0.50000000 0.03389831 0.04166667 setosa
## 37 0.33333333 0.62500000 0.05084746 0.04166667 setosa
## 38 0.16666667 0.66666667 0.06779661 0.00000000 setosa
## 39 0.02777778 0.41666667 0.05084746 0.04166667 setosa
## 40 0.22222222 0.58333333 0.08474576 0.04166667 setosa
## 41 0.19444444 0.62500000 0.05084746 0.08333333 setosa
## 42 0.05555556 0.12500000 0.05084746 0.08333333 setosa
## 43 0.02777778 0.50000000 0.05084746 0.04166667 setosa
## 44 0.19444444 0.62500000 0.10169492 0.20833333 setosa
## 45 0.22222222 0.75000000 0.15254237 0.12500000 setosa
## 46 0.13888889 0.41666667 0.06779661 0.08333333 setosa
## 47 0.22222222 0.75000000 0.10169492 0.04166667 setosa
## 48 0.08333333 0.50000000 0.06779661 0.04166667 setosa
## 49 0.27777778 0.70833333 0.08474576 0.04166667 setosa
## 50 0.19444444 0.54166667 0.06779661 0.04166667 setosa
## 51 0.75000000 0.50000000 0.62711864 0.54166667 versicolor
## 52 0.58333333 0.50000000 0.59322034 0.58333333 versicolor
## 53 0.72222222 0.45833333 0.66101695 0.58333333 versicolor
## 54 0.33333333 0.12500000 0.50847458 0.50000000 versicolor
## 55 0.61111111 0.33333333 0.61016949 0.58333333 versicolor
## 56 0.38888889 0.33333333 0.59322034 0.50000000 versicolor
## 57 0.55555556 0.54166667 0.62711864 0.62500000 versicolor
## 58 0.16666667 0.16666667 0.38983051 0.37500000 versicolor
## 59 0.63888889 0.37500000 0.61016949 0.50000000 versicolor
## 60 0.25000000 0.29166667 0.49152542 0.54166667 versicolor
## 61 0.19444444 0.00000000 0.42372881 0.37500000 versicolor
## 62 0.44444444 0.41666667 0.54237288 0.58333333 versicolor
## 63 0.47222222 0.08333333 0.50847458 0.37500000 versicolor
## 64 0.50000000 0.37500000 0.62711864 0.54166667 versicolor
## 65 0.36111111 0.37500000 0.44067797 0.50000000 versicolor
## 66 0.66666667 0.45833333 0.57627119 0.54166667 versicolor
## 67 0.36111111 0.41666667 0.59322034 0.58333333 versicolor
## 68 0.41666667 0.29166667 0.52542373 0.37500000 versicolor
## 69 0.52777778 0.08333333 0.59322034 0.58333333 versicolor
## 70 0.36111111 0.20833333 0.49152542 0.41666667 versicolor
## 71 0.44444444 0.50000000 0.64406780 0.70833333 versicolor
## 72 0.50000000 0.33333333 0.50847458 0.50000000 versicolor
## 73 0.55555556 0.20833333 0.66101695 0.58333333 versicolor
## 74 0.50000000 0.33333333 0.62711864 0.45833333 versicolor
## 75 0.58333333 0.37500000 0.55932203 0.50000000 versicolor
## 76 0.63888889 0.41666667 0.57627119 0.54166667 versicolor
## 77 0.69444444 0.33333333 0.64406780 0.54166667 versicolor
## 78 0.66666667 0.41666667 0.67796610 0.66666667 versicolor
## 79 0.47222222 0.37500000 0.59322034 0.58333333 versicolor
## 80 0.38888889 0.25000000 0.42372881 0.37500000 versicolor
## 81 0.33333333 0.16666667 0.47457627 0.41666667 versicolor
## 82 0.33333333 0.16666667 0.45762712 0.37500000 versicolor
## 83 0.41666667 0.29166667 0.49152542 0.45833333 versicolor
## 84 0.47222222 0.29166667 0.69491525 0.62500000 versicolor
## 85 0.30555556 0.41666667 0.59322034 0.58333333 versicolor
## 86 0.47222222 0.58333333 0.59322034 0.62500000 versicolor
## 87 0.66666667 0.45833333 0.62711864 0.58333333 versicolor
## 88 0.55555556 0.12500000 0.57627119 0.50000000 versicolor
## 89 0.36111111 0.41666667 0.52542373 0.50000000 versicolor
## 90 0.33333333 0.20833333 0.50847458 0.50000000 versicolor
## 91 0.33333333 0.25000000 0.57627119 0.45833333 versicolor
## 92 0.50000000 0.41666667 0.61016949 0.54166667 versicolor
## 93 0.41666667 0.25000000 0.50847458 0.45833333 versicolor
## 94 0.19444444 0.12500000 0.38983051 0.37500000 versicolor
## 95 0.36111111 0.29166667 0.54237288 0.50000000 versicolor
## 96 0.38888889 0.41666667 0.54237288 0.45833333 versicolor
## 97 0.38888889 0.37500000 0.54237288 0.50000000 versicolor
## 98 0.52777778 0.37500000 0.55932203 0.50000000 versicolor
## 99 0.22222222 0.20833333 0.33898305 0.41666667 versicolor
## 100 0.38888889 0.33333333 0.52542373 0.50000000 versicolor
## 101 0.55555556 0.54166667 0.84745763 1.00000000 virginica
## 102 0.41666667 0.29166667 0.69491525 0.75000000 virginica
## 103 0.77777778 0.41666667 0.83050847 0.83333333 virginica
## 104 0.55555556 0.37500000 0.77966102 0.70833333 virginica
## 105 0.61111111 0.41666667 0.81355932 0.87500000 virginica
## 106 0.91666667 0.41666667 0.94915254 0.83333333 virginica
## 107 0.16666667 0.20833333 0.59322034 0.66666667 virginica
## 108 0.83333333 0.37500000 0.89830508 0.70833333 virginica
## 109 0.66666667 0.20833333 0.81355932 0.70833333 virginica
## 110 0.80555556 0.66666667 0.86440678 1.00000000 virginica
## 111 0.61111111 0.50000000 0.69491525 0.79166667 virginica
## 112 0.58333333 0.29166667 0.72881356 0.75000000 virginica
## 113 0.69444444 0.41666667 0.76271186 0.83333333 virginica
## 114 0.38888889 0.20833333 0.67796610 0.79166667 virginica
## 115 0.41666667 0.33333333 0.69491525 0.95833333 virginica
## 116 0.58333333 0.50000000 0.72881356 0.91666667 virginica
## 117 0.61111111 0.41666667 0.76271186 0.70833333 virginica
## 118 0.94444444 0.75000000 0.96610169 0.87500000 virginica
## 119 0.94444444 0.25000000 1.00000000 0.91666667 virginica
## 120 0.47222222 0.08333333 0.67796610 0.58333333 virginica
## 121 0.72222222 0.50000000 0.79661017 0.91666667 virginica
## 122 0.36111111 0.33333333 0.66101695 0.79166667 virginica
## 123 0.94444444 0.33333333 0.96610169 0.79166667 virginica
## 124 0.55555556 0.29166667 0.66101695 0.70833333 virginica
## 125 0.66666667 0.54166667 0.79661017 0.83333333 virginica
## 126 0.80555556 0.50000000 0.84745763 0.70833333 virginica
## 127 0.52777778 0.33333333 0.64406780 0.70833333 virginica
## 128 0.50000000 0.41666667 0.66101695 0.70833333 virginica
## 129 0.58333333 0.33333333 0.77966102 0.83333333 virginica
## 130 0.80555556 0.41666667 0.81355932 0.62500000 virginica
## 131 0.86111111 0.33333333 0.86440678 0.75000000 virginica
## 132 1.00000000 0.75000000 0.91525424 0.79166667 virginica
## 133 0.58333333 0.33333333 0.77966102 0.87500000 virginica
## 134 0.55555556 0.33333333 0.69491525 0.58333333 virginica
## 135 0.50000000 0.25000000 0.77966102 0.54166667 virginica
## 136 0.94444444 0.41666667 0.86440678 0.91666667 virginica
## 137 0.55555556 0.58333333 0.77966102 0.95833333 virginica
## 138 0.58333333 0.45833333 0.76271186 0.70833333 virginica
## 139 0.47222222 0.41666667 0.64406780 0.70833333 virginica
## 140 0.72222222 0.45833333 0.74576271 0.83333333 virginica
## 141 0.66666667 0.45833333 0.77966102 0.95833333 virginica
## 142 0.72222222 0.45833333 0.69491525 0.91666667 virginica
## 143 0.41666667 0.29166667 0.69491525 0.75000000 virginica
## 144 0.69444444 0.50000000 0.83050847 0.91666667 virginica
## 145 0.66666667 0.54166667 0.79661017 1.00000000 virginica
## 146 0.66666667 0.41666667 0.71186441 0.91666667 virginica
## 147 0.55555556 0.20833333 0.67796610 0.75000000 virginica
## 148 0.61111111 0.41666667 0.71186441 0.79166667 virginica
## 149 0.52777778 0.58333333 0.74576271 0.91666667 virginica
## 150 0.44444444 0.41666667 0.69491525 0.70833333 virginica
Not every model requires data re-scaling. For example, regression and classification tree models do not require re-scaling.
Data reduction: group observations into clusters each contains similar individuals. It is on rows.
Dimension reduction: Reduce the number of features and is intended to improve predictive power, manageability, and interpretability. It is on columns.