Data Cleaning, Re-scaling, and Partition

Types of Data Mining Problems

Data mining is a process of data. We will cover a few types of data mining problems:

Classification
Prediction
Clustering

The first two are called supervised learning methods and the third is an unsupervised learning method.

Generally, there are 4 steps in a data mining problem:

Problem definition: Define the problem clearly. It should be related to one or more objectives. Describe the deliverables so that the team can be focused on delivering the solution and provide correct expectations to clients interested in the outcome of the project. The project team should be best multidisciplinary and has a leader. A plan for the project should be developed with a timeline and a budget.A cost-benefit analysis can form a basis for a go/no-go decision for the project.
Data Preparation: The quality of data is the most important aspect that influences the quality of the results from the analysis. Data preparation may include removing potential errors, eliminating variables highly correlated to other ones, re-scaling data, and partition data in smaller sets. All the steps in data preparation should be documented.
Implementation of Analysis: This includes (1) data summary with tables, graphs, as well as descriptive statistics and (2) model training with training data, model selection with validation data, model evaluation with test data.
Deployment: Outline the analysis and results. Train the practitioners how to use and interpret the results. Performance should be measured, monitored, and documented. A plan for updating the model should also be developed.

Data Cleaning with the janitor Package

Removing Empty Rows or Columns.

# data frame contain empty rows or columns
df <- data.frame(
  x1 = c(10, 12, NA),
  x2 = c(24, 15, 8),
  x3 = c(NA, NA, NA)
)

# display df
df

##   x1 x2 x3
## 1 10 24 NA
## 2 12 15 NA
## 3 NA  8 NA

# Remove empty rows and columns
df %>% remove_empty(which = c("rows", "cols"))

##   x1 x2
## 1 10 24
## 2 12 15
## 3 NA  8

Removing Constant Column

# Data contain a constant column
df <- data.frame(
  x1 = c(1, 2, 3),
  x2 = c(4, 5, 6),
  x3 = c(7, 7, 7)
)

# Remove constant
df %>% remove_constant()

##   x1 x2
## 1  1  4
## 2  2  5
## 3  3  6

Data Validation with the data.validator Package

# Create an object for storing validation results
report <- data_validation_report()

# Create a function for checking whether a value is in a vector
is.in = function(v){
  function(x){return(x %in% v)}
}

# Create a function to check if a value is between two given values.
between = function(a, b){
  function(x) { 
    return(a <= x & x <= b)
  }
}

# Create a function to check if a value is numeric.
is.Numeric = function(){
  function(x){return(!is.na(as.numeric(x)))}
}

# Data to be validated
D = data.frame(student = 1:5, 
               height = c(169, 180, "177 cm", "49cm", 192),
               gender = c("M", "M", "Female", "male", "F"),
               free.throw.rate = c(0.78, 0.90, 1.33, -0.23, 0.55)
              )

# Prepare data for validation
validate(D, name = "Verifying student dataset") %>%
  validate_cols(is.Numeric(), height, description = "height is numeric") %>%
  validate_cols(between(0, 1), free.throw.rate, description = "free throw rate is between 0 and 1") %>%
  validate_cols(is.in(c("M", "F")), gender, description = "gender is either M or F") %>%
  add_results(report)

## Warning in improper.predicate(x): NAs introduced by coercion

## Warning in improper.predicate(x): NAs introduced by coercion

print(report)

## Validation summary: 
##  Number of successful validations: 0
##  Number of failed validations: 3
##  Number of validations with warnings: 0
## 
## Advanced view: 
## 
## 
## |table_name                |description                        |type  | total_violations|
## |:-------------------------|:----------------------------------|:-----|----------------:|
## |Verifying student dataset |free throw rate is between 0 and 1 |error |                2|
## |Verifying student dataset |gender is either M or F            |error |                2|
## |Verifying student dataset |height is numeric                  |error |                2|

Missing Value Imputation with the mice Package

We will use the data frame “nhanes” from the mice package for demo.

# Display the data
nhanes

##    age  bmi hyp chl
## 1    1   NA  NA  NA
## 2    2 22.7   1 187
## 3    1   NA   1 187
## 4    3   NA  NA  NA
## 5    1 20.4   1 113
## 6    3   NA  NA 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2   NA  NA  NA
## 11   1   NA  NA  NA
## 12   2   NA  NA  NA
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1  NA
## 16   1   NA  NA  NA
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2  NA
## 21   1   NA  NA  NA
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1  NA
## 25   2 27.4   1 186

# Display missingness
library(Amelia)

## Loading required package: Rcpp

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.1, built: 2022-11-18)
## ## Copyright (C) 2005-2023 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

missmap(nhanes)

The variable names of the data frame are on the x-axis, while Row numbers are on the y-axis.

The following are the two steps for imputing missing values in a data frame.

# How to impute?
nhanes %>% mice() %>% complete()

## 
##  iter imp variable
##   1   1  bmi  hyp  chl
##   1   2  bmi  hyp  chl
##   1   3  bmi  hyp  chl
##   1   4  bmi  hyp  chl
##   1   5  bmi  hyp  chl
##   2   1  bmi  hyp  chl
##   2   2  bmi  hyp  chl
##   2   3  bmi  hyp  chl
##   2   4  bmi  hyp  chl
##   2   5  bmi  hyp  chl
##   3   1  bmi  hyp  chl
##   3   2  bmi  hyp  chl
##   3   3  bmi  hyp  chl
##   3   4  bmi  hyp  chl
##   3   5  bmi  hyp  chl
##   4   1  bmi  hyp  chl
##   4   2  bmi  hyp  chl
##   4   3  bmi  hyp  chl
##   4   4  bmi  hyp  chl
##   4   5  bmi  hyp  chl
##   5   1  bmi  hyp  chl
##   5   2  bmi  hyp  chl
##   5   3  bmi  hyp  chl
##   5   4  bmi  hyp  chl
##   5   5  bmi  hyp  chl

##    age  bmi hyp chl
## 1    1 30.1   1 199
## 2    2 22.7   1 187
## 3    1 27.2   1 187
## 4    3 35.3   2 284
## 5    1 20.4   1 113
## 6    3 22.5   1 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2 25.5   2 238
## 11   1 21.7   1 238
## 12   2 25.5   1 206
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1 229
## 16   1 30.1   1 229
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2 184
## 21   1 22.0   1 238
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1 184
## 25   2 27.4   1 186

Data Partition

In supervised learning, we need to partition data into

training set, used to train/fit models
validation set, used to choose the best model (apply each trained model to the validation set and choose the model with the highest performance in terms of some criterion), and
test set, on which the performance of the best model is reported.

usually in the ratio of 60%:20%:20%.

A Function for Partitioning Data into 3 Sets

partitionData = function(df, prop = c(0.60, 0.20, 0.20)){
    n = nrow(df)
    idx = sample(1:n, n)
    n1 = round(n*prop[1])
    n2 = round(n*prop[2])
    n3 = n - n1 - n2
    
    train.idx = idx[1:n1]
    valid.idx = idx[(n1+1):(n1+n2)]
    test.idx = idx[-c(1:(n1+n2))]
    
    train = df[train.idx, ]
    valid = df[valid.idx, ]
    test = df[test.idx, ]
    if (length(prop) == 2){
      list(train = train, valid = valid)
    } else{
      list(train = train, valid = valid, test = test)
    }
    
}

partitionData(iris)

## $train
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 82           5.5         2.4          3.7         1.0 versicolor
## 31           4.8         3.1          1.6         0.2     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 58           4.9         2.4          3.3         1.0 versicolor
## 104          6.3         2.9          5.6         1.8  virginica
## 77           6.8         2.8          4.8         1.4 versicolor
## 7            4.6         3.4          1.4         0.3     setosa
## 115          5.8         2.8          5.1         2.4  virginica
## 3            4.7         3.2          1.3         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 101          6.3         3.3          6.0         2.5  virginica
## 150          5.9         3.0          5.1         1.8  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 6            5.4         3.9          1.7         0.4     setosa
## 148          6.5         3.0          5.2         2.0  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 99           5.1         2.5          3.0         1.1 versicolor
## 48           4.6         3.2          1.4         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 53           6.9         3.1          4.9         1.5 versicolor
## 4            4.6         3.1          1.5         0.2     setosa
## 73           6.3         2.5          4.9         1.5 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 32           5.4         3.4          1.5         0.4     setosa
## 90           5.5         2.5          4.0         1.3 versicolor
## 12           4.8         3.4          1.6         0.2     setosa
## 113          6.8         3.0          5.5         2.1  virginica
## 40           5.1         3.4          1.5         0.2     setosa
## 76           6.6         3.0          4.4         1.4 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 9            4.4         2.9          1.4         0.2     setosa
## 102          5.8         2.7          5.1         1.9  virginica
## 33           5.2         4.1          1.5         0.1     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 37           5.5         3.5          1.3         0.2     setosa
## 114          5.7         2.5          5.0         2.0  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 95           5.6         2.7          4.2         1.3 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 27           5.0         3.4          1.6         0.4     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 67           5.6         3.0          4.5         1.5 versicolor
## 118          7.7         3.8          6.7         2.2  virginica
## 46           4.8         3.0          1.4         0.3     setosa
## 94           5.0         2.3          3.3         1.0 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 138          6.4         3.1          5.5         1.8  virginica
## 81           5.5         2.4          3.8         1.1 versicolor
## 105          6.5         3.0          5.8         2.2  virginica
## 49           5.3         3.7          1.5         0.2     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 80           5.7         2.6          3.5         1.0 versicolor
## 139          6.0         3.0          4.8         1.8  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 84           6.0         2.7          5.1         1.6 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 39           4.4         3.0          1.3         0.2     setosa
## 116          6.4         3.2          5.3         2.3  virginica
## 100          5.7         2.8          4.1         1.3 versicolor
## 41           5.0         3.5          1.3         0.3     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 61           5.0         2.0          3.5         1.0 versicolor
## 149          6.2         3.4          5.4         2.3  virginica
## 71           5.9         3.2          4.8         1.8 versicolor
## 145          6.7         3.3          5.7         2.5  virginica
## 1            5.1         3.5          1.4         0.2     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 88           6.3         2.3          4.4         1.3 versicolor
## 11           5.4         3.7          1.5         0.2     setosa
## 72           6.1         2.8          4.0         1.3 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 141          6.7         3.1          5.6         2.4  virginica
## 86           6.0         3.4          4.5         1.6 versicolor
## 132          7.9         3.8          6.4         2.0  virginica
## 52           6.4         3.2          4.5         1.5 versicolor
## 147          6.3         2.5          5.0         1.9  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 64           6.1         2.9          4.7         1.4 versicolor
## 2            4.9         3.0          1.4         0.2     setosa
## 
## $valid
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 97           5.7         2.9          4.2         1.3 versicolor
## 126          7.2         3.2          6.0         1.8  virginica
## 79           6.0         2.9          4.5         1.5 versicolor
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 63           6.0         2.2          4.0         1.0 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 34           5.5         4.2          1.4         0.2     setosa
## 119          7.7         2.6          6.9         2.3  virginica
## 70           5.6         2.5          3.9         1.1 versicolor
## 129          6.4         2.8          5.6         2.1  virginica
## 89           5.6         3.0          4.1         1.3 versicolor
## 50           5.0         3.3          1.4         0.2     setosa
## 133          6.4         2.8          5.6         2.2  virginica
## 10           4.9         3.1          1.5         0.1     setosa
## 134          6.3         2.8          5.1         1.5  virginica
## 16           5.7         4.4          1.5         0.4     setosa
## 68           5.8         2.7          4.1         1.0 versicolor
## 22           5.1         3.7          1.5         0.4     setosa
## 130          7.2         3.0          5.8         1.6  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 13           4.8         3.0          1.4         0.1     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 55           6.5         2.8          4.6         1.5 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 144          6.8         3.2          5.9         2.3  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 69           6.2         2.2          4.5         1.5 versicolor
## 
## $test
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 127          6.2         2.8          4.8         1.8  virginica
## 51           7.0         3.2          4.7         1.4 versicolor
## 137          6.3         3.4          5.6         2.4  virginica
## 74           6.1         2.8          4.7         1.2 versicolor
## 136          7.7         3.0          6.1         2.3  virginica
## 92           6.1         3.0          4.6         1.4 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 45           5.1         3.8          1.9         0.4     setosa
## 60           5.2         2.7          3.9         1.4 versicolor
## 14           4.3         3.0          1.1         0.1     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 106          7.6         3.0          6.6         2.1  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 43           4.4         3.2          1.3         0.2     setosa
## 128          6.1         3.0          4.9         1.8  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 8            5.0         3.4          1.5         0.2     setosa
## 78           6.7         3.0          5.0         1.7 versicolor
## 140          6.9         3.1          5.4         2.1  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 87           6.7         3.1          4.7         1.5 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 5            5.0         3.6          1.4         0.2     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 62           5.9         3.0          4.2         1.5 versicolor
## 15           5.8         4.0          1.2         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 96           5.7         3.0          4.2         1.2 versicolor

If you fit only one model, the validation set is not needed and the test set is called the validation set. Just some naming changes.

Using the caret Package to Partition Data into k Folds

We can use the function createFolds() from the caret (classification and regression tree) package to partition vector randomly into k folds of about equal size.

The following code partition the vector 1 to 20 randomly into 3 folds.

createFolds(1:20, 3)

## $Fold1
## [1]  2  7 11 14 16 20
## 
## $Fold2
## [1]  1  4  6  8 12 13 17
## 
## $Fold3
## [1]  3  5  9 10 15 18 19

The following code partition the mtcars data frame randomly into 3 datasets.

n = nrow(mtcars)
# partition the indices (i.e. row numbers) first. The 3 sets of indices are stored in a list (I here).
I = createFolds(1:n, 3)
# Print I
I

## $Fold1
##  [1]  2  6  7 11 14 15 20 22 24 26 29 32
## 
## $Fold2
##  [1]  1  4  5  9 10 17 23 28 30 31
## 
## $Fold3
##  [1]  3  8 12 13 16 18 19 21 25 27

# get the 3 datasets
set1 = mtcars[I$Fold1, ]

set2 = mtcars[I$Fold2, ]

set3 = mtcars[I$Fold3, ]

Question: How can you use the createFolds() function to partition the mtcars data frame into training (70%), validation (15%), and test (15%)?

A note: The textbook often just partitions a data frame into training and validation sets. We will follow this practice. In this situtation, you can use the createDataPartion() function instead of createFolds() to create two sets of indices. The syntax is

createDataPartition(1:nrow(YourDataFrame), p = 0.7)

where p is the proportion of data going to the training set.

Re-scaling Data with the caret Package

In machine learning, re-scaling numerical variables (also called features) that have varying scales is a good idea when fitting models such as kNN and neural network.

re-scaling can speed up the convergence of your algorithm and thus increase the validation performance of the model being used.

For a numeric variable say x, commonly used re-scaling methods include

standardization = (x - mean)/(standard deviation), called z-score
normalization = (x - min)/(max - min)

The former (z-score method) is suggested when your model requires features to have normal distributions. When there is no such requirement, use the later (range method).

D = iris
process <- preProcess(D, method="range")  # use c("center", "scale") for standardization

normorlized.D <- predict(process, D)

normorlized.D

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1     0.22222222  0.62500000   0.06779661  0.04166667     setosa
## 2     0.16666667  0.41666667   0.06779661  0.04166667     setosa
## 3     0.11111111  0.50000000   0.05084746  0.04166667     setosa
## 4     0.08333333  0.45833333   0.08474576  0.04166667     setosa
## 5     0.19444444  0.66666667   0.06779661  0.04166667     setosa
## 6     0.30555556  0.79166667   0.11864407  0.12500000     setosa
## 7     0.08333333  0.58333333   0.06779661  0.08333333     setosa
## 8     0.19444444  0.58333333   0.08474576  0.04166667     setosa
## 9     0.02777778  0.37500000   0.06779661  0.04166667     setosa
## 10    0.16666667  0.45833333   0.08474576  0.00000000     setosa
## 11    0.30555556  0.70833333   0.08474576  0.04166667     setosa
## 12    0.13888889  0.58333333   0.10169492  0.04166667     setosa
## 13    0.13888889  0.41666667   0.06779661  0.00000000     setosa
## 14    0.00000000  0.41666667   0.01694915  0.00000000     setosa
## 15    0.41666667  0.83333333   0.03389831  0.04166667     setosa
## 16    0.38888889  1.00000000   0.08474576  0.12500000     setosa
## 17    0.30555556  0.79166667   0.05084746  0.12500000     setosa
## 18    0.22222222  0.62500000   0.06779661  0.08333333     setosa
## 19    0.38888889  0.75000000   0.11864407  0.08333333     setosa
## 20    0.22222222  0.75000000   0.08474576  0.08333333     setosa
## 21    0.30555556  0.58333333   0.11864407  0.04166667     setosa
## 22    0.22222222  0.70833333   0.08474576  0.12500000     setosa
## 23    0.08333333  0.66666667   0.00000000  0.04166667     setosa
## 24    0.22222222  0.54166667   0.11864407  0.16666667     setosa
## 25    0.13888889  0.58333333   0.15254237  0.04166667     setosa
## 26    0.19444444  0.41666667   0.10169492  0.04166667     setosa
## 27    0.19444444  0.58333333   0.10169492  0.12500000     setosa
## 28    0.25000000  0.62500000   0.08474576  0.04166667     setosa
## 29    0.25000000  0.58333333   0.06779661  0.04166667     setosa
## 30    0.11111111  0.50000000   0.10169492  0.04166667     setosa
## 31    0.13888889  0.45833333   0.10169492  0.04166667     setosa
## 32    0.30555556  0.58333333   0.08474576  0.12500000     setosa
## 33    0.25000000  0.87500000   0.08474576  0.00000000     setosa
## 34    0.33333333  0.91666667   0.06779661  0.04166667     setosa
## 35    0.16666667  0.45833333   0.08474576  0.04166667     setosa
## 36    0.19444444  0.50000000   0.03389831  0.04166667     setosa
## 37    0.33333333  0.62500000   0.05084746  0.04166667     setosa
## 38    0.16666667  0.66666667   0.06779661  0.00000000     setosa
## 39    0.02777778  0.41666667   0.05084746  0.04166667     setosa
## 40    0.22222222  0.58333333   0.08474576  0.04166667     setosa
## 41    0.19444444  0.62500000   0.05084746  0.08333333     setosa
## 42    0.05555556  0.12500000   0.05084746  0.08333333     setosa
## 43    0.02777778  0.50000000   0.05084746  0.04166667     setosa
## 44    0.19444444  0.62500000   0.10169492  0.20833333     setosa
## 45    0.22222222  0.75000000   0.15254237  0.12500000     setosa
## 46    0.13888889  0.41666667   0.06779661  0.08333333     setosa
## 47    0.22222222  0.75000000   0.10169492  0.04166667     setosa
## 48    0.08333333  0.50000000   0.06779661  0.04166667     setosa
## 49    0.27777778  0.70833333   0.08474576  0.04166667     setosa
## 50    0.19444444  0.54166667   0.06779661  0.04166667     setosa
## 51    0.75000000  0.50000000   0.62711864  0.54166667 versicolor
## 52    0.58333333  0.50000000   0.59322034  0.58333333 versicolor
## 53    0.72222222  0.45833333   0.66101695  0.58333333 versicolor
## 54    0.33333333  0.12500000   0.50847458  0.50000000 versicolor
## 55    0.61111111  0.33333333   0.61016949  0.58333333 versicolor
## 56    0.38888889  0.33333333   0.59322034  0.50000000 versicolor
## 57    0.55555556  0.54166667   0.62711864  0.62500000 versicolor
## 58    0.16666667  0.16666667   0.38983051  0.37500000 versicolor
## 59    0.63888889  0.37500000   0.61016949  0.50000000 versicolor
## 60    0.25000000  0.29166667   0.49152542  0.54166667 versicolor
## 61    0.19444444  0.00000000   0.42372881  0.37500000 versicolor
## 62    0.44444444  0.41666667   0.54237288  0.58333333 versicolor
## 63    0.47222222  0.08333333   0.50847458  0.37500000 versicolor
## 64    0.50000000  0.37500000   0.62711864  0.54166667 versicolor
## 65    0.36111111  0.37500000   0.44067797  0.50000000 versicolor
## 66    0.66666667  0.45833333   0.57627119  0.54166667 versicolor
## 67    0.36111111  0.41666667   0.59322034  0.58333333 versicolor
## 68    0.41666667  0.29166667   0.52542373  0.37500000 versicolor
## 69    0.52777778  0.08333333   0.59322034  0.58333333 versicolor
## 70    0.36111111  0.20833333   0.49152542  0.41666667 versicolor
## 71    0.44444444  0.50000000   0.64406780  0.70833333 versicolor
## 72    0.50000000  0.33333333   0.50847458  0.50000000 versicolor
## 73    0.55555556  0.20833333   0.66101695  0.58333333 versicolor
## 74    0.50000000  0.33333333   0.62711864  0.45833333 versicolor
## 75    0.58333333  0.37500000   0.55932203  0.50000000 versicolor
## 76    0.63888889  0.41666667   0.57627119  0.54166667 versicolor
## 77    0.69444444  0.33333333   0.64406780  0.54166667 versicolor
## 78    0.66666667  0.41666667   0.67796610  0.66666667 versicolor
## 79    0.47222222  0.37500000   0.59322034  0.58333333 versicolor
## 80    0.38888889  0.25000000   0.42372881  0.37500000 versicolor
## 81    0.33333333  0.16666667   0.47457627  0.41666667 versicolor
## 82    0.33333333  0.16666667   0.45762712  0.37500000 versicolor
## 83    0.41666667  0.29166667   0.49152542  0.45833333 versicolor
## 84    0.47222222  0.29166667   0.69491525  0.62500000 versicolor
## 85    0.30555556  0.41666667   0.59322034  0.58333333 versicolor
## 86    0.47222222  0.58333333   0.59322034  0.62500000 versicolor
## 87    0.66666667  0.45833333   0.62711864  0.58333333 versicolor
## 88    0.55555556  0.12500000   0.57627119  0.50000000 versicolor
## 89    0.36111111  0.41666667   0.52542373  0.50000000 versicolor
## 90    0.33333333  0.20833333   0.50847458  0.50000000 versicolor
## 91    0.33333333  0.25000000   0.57627119  0.45833333 versicolor
## 92    0.50000000  0.41666667   0.61016949  0.54166667 versicolor
## 93    0.41666667  0.25000000   0.50847458  0.45833333 versicolor
## 94    0.19444444  0.12500000   0.38983051  0.37500000 versicolor
## 95    0.36111111  0.29166667   0.54237288  0.50000000 versicolor
## 96    0.38888889  0.41666667   0.54237288  0.45833333 versicolor
## 97    0.38888889  0.37500000   0.54237288  0.50000000 versicolor
## 98    0.52777778  0.37500000   0.55932203  0.50000000 versicolor
## 99    0.22222222  0.20833333   0.33898305  0.41666667 versicolor
## 100   0.38888889  0.33333333   0.52542373  0.50000000 versicolor
## 101   0.55555556  0.54166667   0.84745763  1.00000000  virginica
## 102   0.41666667  0.29166667   0.69491525  0.75000000  virginica
## 103   0.77777778  0.41666667   0.83050847  0.83333333  virginica
## 104   0.55555556  0.37500000   0.77966102  0.70833333  virginica
## 105   0.61111111  0.41666667   0.81355932  0.87500000  virginica
## 106   0.91666667  0.41666667   0.94915254  0.83333333  virginica
## 107   0.16666667  0.20833333   0.59322034  0.66666667  virginica
## 108   0.83333333  0.37500000   0.89830508  0.70833333  virginica
## 109   0.66666667  0.20833333   0.81355932  0.70833333  virginica
## 110   0.80555556  0.66666667   0.86440678  1.00000000  virginica
## 111   0.61111111  0.50000000   0.69491525  0.79166667  virginica
## 112   0.58333333  0.29166667   0.72881356  0.75000000  virginica
## 113   0.69444444  0.41666667   0.76271186  0.83333333  virginica
## 114   0.38888889  0.20833333   0.67796610  0.79166667  virginica
## 115   0.41666667  0.33333333   0.69491525  0.95833333  virginica
## 116   0.58333333  0.50000000   0.72881356  0.91666667  virginica
## 117   0.61111111  0.41666667   0.76271186  0.70833333  virginica
## 118   0.94444444  0.75000000   0.96610169  0.87500000  virginica
## 119   0.94444444  0.25000000   1.00000000  0.91666667  virginica
## 120   0.47222222  0.08333333   0.67796610  0.58333333  virginica
## 121   0.72222222  0.50000000   0.79661017  0.91666667  virginica
## 122   0.36111111  0.33333333   0.66101695  0.79166667  virginica
## 123   0.94444444  0.33333333   0.96610169  0.79166667  virginica
## 124   0.55555556  0.29166667   0.66101695  0.70833333  virginica
## 125   0.66666667  0.54166667   0.79661017  0.83333333  virginica
## 126   0.80555556  0.50000000   0.84745763  0.70833333  virginica
## 127   0.52777778  0.33333333   0.64406780  0.70833333  virginica
## 128   0.50000000  0.41666667   0.66101695  0.70833333  virginica
## 129   0.58333333  0.33333333   0.77966102  0.83333333  virginica
## 130   0.80555556  0.41666667   0.81355932  0.62500000  virginica
## 131   0.86111111  0.33333333   0.86440678  0.75000000  virginica
## 132   1.00000000  0.75000000   0.91525424  0.79166667  virginica
## 133   0.58333333  0.33333333   0.77966102  0.87500000  virginica
## 134   0.55555556  0.33333333   0.69491525  0.58333333  virginica
## 135   0.50000000  0.25000000   0.77966102  0.54166667  virginica
## 136   0.94444444  0.41666667   0.86440678  0.91666667  virginica
## 137   0.55555556  0.58333333   0.77966102  0.95833333  virginica
## 138   0.58333333  0.45833333   0.76271186  0.70833333  virginica
## 139   0.47222222  0.41666667   0.64406780  0.70833333  virginica
## 140   0.72222222  0.45833333   0.74576271  0.83333333  virginica
## 141   0.66666667  0.45833333   0.77966102  0.95833333  virginica
## 142   0.72222222  0.45833333   0.69491525  0.91666667  virginica
## 143   0.41666667  0.29166667   0.69491525  0.75000000  virginica
## 144   0.69444444  0.50000000   0.83050847  0.91666667  virginica
## 145   0.66666667  0.54166667   0.79661017  1.00000000  virginica
## 146   0.66666667  0.41666667   0.71186441  0.91666667  virginica
## 147   0.55555556  0.20833333   0.67796610  0.75000000  virginica
## 148   0.61111111  0.41666667   0.71186441  0.79166667  virginica
## 149   0.52777778  0.58333333   0.74576271  0.91666667  virginica
## 150   0.44444444  0.41666667   0.69491525  0.70833333  virginica

Not every model requires data re-scaling. For example, regression and classification tree models do not require re-scaling.

Data Reduction and Dimension Reduction

Data reduction: group observations into clusters each contains similar individuals. It is on rows.

Dimension reduction: Reduce the number of features and is intended to improve predictive power, manageability, and interpretability. It is on columns.