Data Mining with R Final Project

Soudeh Khoubrouy

Instructor: Charles Pierre, PhD, M.Sc. in Analytics

Directions: There are 10 very popular standard datasets for practicing data engineering and mining, and applied machine learning. These datasets are well researched, well studied and well used; and, they have been studied for quite a while. Please go to the internet and find at least two of these datasets. The datasets are the following: 1. Swedish Auto Insurance Dataset 2. Wine Quality Dataset 3. Pima Indians Diabetes Dataset 4. Sonar Dataset 5. Banknote Dataset 6. Iris Flowers Dataset 7. Abalone Dataset 8. Ionosphere Dataset 9. Wheat Seed Dataset 10. Boston House Price Dataset

Your job is to take any two of these datasets, study the complete dataset and know all of its characteristics (e.g., what is the dimension of the dataset, how many [if any] factors are there, determine whether there are correlations between features, …)
Then, take each dataset and take it through a complete pre-processing algorithm/task. [Hint: you can use Assignment 2; you should also read pages 53 – 65 of L. Torgo’s text. Also read pages 78-95 of Torgo]
Keep in mind that you are practicing on two different datasets. So, run each dataset through the preprocessing phase.
Once your two datasets have been pre-processed completely, do a complete data visualization on each (Read Pp. 96-110, L. Torgo)
Finally, prepare your two projects for presentation. Note: This is an individual assignment. Not a group assignment.

rm(list = ls())

Abalone Dataset

This dataset is about predicting the age of abalone from physical measurements. The age of abalone is determined by cutting its shell through the cone, staining it, and counting the number of rings through a microscope. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.

I downloaded this dataset from UCI website. The dataset does not have any missing value. Also, the ranges of the continuous values have been scaled for use with an ANN (by dividing by 200). The data has 8 features of abalone:

sex
length
diameter
height
whole weight
shucked weight
viscera weight
shell weight

and its rings as its dependent/target variable.

More about the Abalone dataset

abalone_data <- read.table("abalone/abalone.data", sep = ',')
abalone_data[,1] <- as.factor(abalone_data[,1])
print('Below is the size and structure of the data and type of each variable:\n')

## [1] "Below is the size and structure of the data and type of each variable:\n"

str(abalone_data)

## 'data.frame':    4177 obs. of  9 variables:
##  $ V1: Factor w/ 3 levels "F","I","M": 3 3 1 3 2 2 1 1 3 1 ...
##  $ V2: num  0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
##  $ V3: num  0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
##  $ V4: num  0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
##  $ V5: num  0.514 0.226 0.677 0.516 0.205 ...
##  $ V6: num  0.2245 0.0995 0.2565 0.2155 0.0895 ...
##  $ V7: num  0.101 0.0485 0.1415 0.114 0.0395 ...
##  $ V8: num  0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
##  $ V9: int  15 7 9 10 7 8 20 16 9 19 ...

print('The sex variable is a factor with three levels.\n')

## [1] "The sex variable is a factor with three levels.\n"

cat("Number of missing values: ", sum(is.na(abalone_data)),'\n')

## Number of missing values:  0

print('Below is the summary of key statistics of the data:\n')

## [1] "Below is the summary of key statistics of the data:\n"

summary(abalone_data)

##  V1             V2              V3               V4               V5        
##  F:1307   Min.   :0.075   Min.   :0.0550   Min.   :0.0000   Min.   :0.0020  
##  I:1342   1st Qu.:0.450   1st Qu.:0.3500   1st Qu.:0.1150   1st Qu.:0.4415  
##  M:1528   Median :0.545   Median :0.4250   Median :0.1400   Median :0.7995  
##           Mean   :0.524   Mean   :0.4079   Mean   :0.1395   Mean   :0.8287  
##           3rd Qu.:0.615   3rd Qu.:0.4800   3rd Qu.:0.1650   3rd Qu.:1.1530  
##           Max.   :0.815   Max.   :0.6500   Max.   :1.1300   Max.   :2.8255  
##        V6               V7               V8               V9        
##  Min.   :0.0010   Min.   :0.0005   Min.   :0.0015   Min.   : 1.000  
##  1st Qu.:0.1860   1st Qu.:0.0935   1st Qu.:0.1300   1st Qu.: 8.000  
##  Median :0.3360   Median :0.1710   Median :0.2340   Median : 9.000  
##  Mean   :0.3594   Mean   :0.1806   Mean   :0.2388   Mean   : 9.934  
##  3rd Qu.:0.5020   3rd Qu.:0.2530   3rd Qu.:0.3290   3rd Qu.:11.000  
##  Max.   :1.4880   Max.   :0.7600   Max.   :1.0050   Max.   :29.000

Check correlations

cor(abalone_data[,2:9])

##           V2        V3        V4        V5        V6        V7        V8
## V2 1.0000000 0.9868116 0.8275536 0.9252612 0.8979137 0.9030177 0.8977056
## V3 0.9868116 1.0000000 0.8336837 0.9254521 0.8931625 0.8997244 0.9053298
## V4 0.8275536 0.8336837 1.0000000 0.8192208 0.7749723 0.7983193 0.8173380
## V5 0.9252612 0.9254521 0.8192208 1.0000000 0.9694055 0.9663751 0.9553554
## V6 0.8979137 0.8931625 0.7749723 0.9694055 1.0000000 0.9319613 0.8826171
## V7 0.9030177 0.8997244 0.7983193 0.9663751 0.9319613 1.0000000 0.9076563
## V8 0.8977056 0.9053298 0.8173380 0.9553554 0.8826171 0.9076563 1.0000000
## V9 0.5567196 0.5746599 0.5574673 0.5403897 0.4208837 0.5038192 0.6275740
##           V9
## V2 0.5567196
## V3 0.5746599
## V4 0.5574673
## V5 0.5403897
## V6 0.4208837
## V7 0.5038192
## V8 0.6275740
## V9 1.0000000

Based on the correlation result, variables V2 and V3 have the highest correlation.

Scale Data

library(caret)
# calculate the pre-process parameters from the dataset
abalone_scale_Params <- preProcess(abalone_data[,2:9], method= c('scale'))
# summarize transform parameters
print(abalone_scale_Params)

## Created from 4177 samples and 8 variables
## 
## Pre-processing:
##   - ignored (0)
##   - scaled (8)

# transform the dataset using the parameters
abalone_scaled <- predict(abalone_scale_Params, abalone_data[,2:9])
summary(abalone_scaled)

##        V2               V3               V4               V5          
##  Min.   :0.6245   Min.   :0.5542   Min.   : 0.000   Min.   :0.004078  
##  1st Qu.:3.7471   1st Qu.:3.5268   1st Qu.: 2.749   1st Qu.:0.900306  
##  Median :4.5382   Median :4.2826   Median : 3.347   Median :1.630338  
##  Mean   :4.3632   Mean   :4.1101   Mean   : 3.336   Mean   :1.689969  
##  3rd Qu.:5.1210   3rd Qu.:4.8368   3rd Qu.: 3.945   3rd Qu.:2.351195  
##  Max.   :6.7864   Max.   :6.5498   Max.   :27.016   Max.   :5.761752  
##        V6                 V7                 V8                V9        
##  Min.   :0.004505   Min.   :0.004561   Min.   :0.01078   Min.   :0.3102  
##  1st Qu.:0.837978   1st Qu.:0.852991   1st Qu.:0.93389   1st Qu.:2.4813  
##  Median :1.513766   Median :1.560016   Median :1.68100   Median :2.7914  
##  Mean   :1.619043   Mean   :1.647538   Mean   :1.71571   Mean   :3.0810  
##  3rd Qu.:2.261639   3rd Qu.:2.308094   3rd Qu.:2.36346   3rd Qu.:3.4117  
##  Max.   :6.703822   Max.   :6.933405   Max.   :7.21969   Max.   :8.9946

Center Data

# load packages
library(caret)
# calculate the pre-process parameters from the dataset
abalone_center_Params <- preProcess(abalone_data[,2:9], method= c('center'))
# summarize transform parameters
print(abalone_center_Params)

## Created from 4177 samples and 8 variables
## 
## Pre-processing:
##   - centered (8)
##   - ignored (0)

# transform the dataset using the parameters
abalone_centered <- predict(abalone_center_Params, abalone_data[,2:9])
summary(abalone_centered)

##        V2                 V3                 V4                   V5          
##  Min.   :-0.44899   Min.   :-0.35288   Min.   :-0.1395164   Min.   :-0.82674  
##  1st Qu.:-0.07399   1st Qu.:-0.05788   1st Qu.:-0.0245164   1st Qu.:-0.38724  
##  Median : 0.02101   Median : 0.01712   Median : 0.0004836   Median :-0.02924  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000000   Mean   : 0.00000  
##  3rd Qu.: 0.09101   3rd Qu.: 0.07212   3rd Qu.: 0.0254836   3rd Qu.: 0.32426  
##  Max.   : 0.29101   Max.   : 0.24212   Max.   : 0.9904836   Max.   : 1.99676  
##        V6                 V7                  V8                  V9         
##  Min.   :-0.35837   Min.   :-0.180094   Min.   :-0.237331   Min.   :-8.9337  
##  1st Qu.:-0.17337   1st Qu.:-0.087094   1st Qu.:-0.108831   1st Qu.:-1.9337  
##  Median :-0.02337   Median :-0.009594   Median :-0.004831   Median :-0.9337  
##  Mean   : 0.00000   Mean   : 0.000000   Mean   : 0.000000   Mean   : 0.0000  
##  3rd Qu.: 0.14263   3rd Qu.: 0.072406   3rd Qu.: 0.090169   3rd Qu.: 1.0663  
##  Max.   : 1.12863   Max.   : 0.579406   Max.   : 0.766169   Max.   :19.0663

Standardize Data

# load packages
library(caret)
# calculate the pre-process parameters from the dataset
abalone_standardize_Params <- preProcess(abalone_data[,2:9], method= c('center','scale'))
# summarize transform parameters
print(abalone_standardize_Params)

## Created from 4177 samples and 8 variables
## 
## Pre-processing:
##   - centered (8)
##   - ignored (0)
##   - scaled (8)

# transform the dataset using the parameters
abalone_standardized <- predict(abalone_standardize_Params, abalone_data[,2:9])
summary(abalone_standardized)

##        V2                V3                V4                 V5          
##  Min.   :-3.7387   Min.   :-3.5558   Min.   :-3.33555   Min.   :-1.68589  
##  1st Qu.:-0.6161   1st Qu.:-0.5832   1st Qu.:-0.58614   1st Qu.:-0.78966  
##  Median : 0.1749   Median : 0.1725   Median : 0.01156   Median :-0.05963  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.7578   3rd Qu.: 0.7267   3rd Qu.: 0.60926   3rd Qu.: 0.66123  
##  Max.   : 2.4232   Max.   : 2.4397   Max.   :23.68045   Max.   : 4.07178  
##        V6                V7                 V8                V9         
##  Min.   :-1.6145   Min.   :-1.64298   Min.   :-1.7049   Min.   :-2.7708  
##  1st Qu.:-0.7811   1st Qu.:-0.79455   1st Qu.:-0.7818   1st Qu.:-0.5997  
##  Median :-0.1053   Median :-0.08752   Median :-0.0347   Median :-0.2896  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6426   3rd Qu.: 0.66056   3rd Qu.: 0.6478   3rd Qu.: 0.3307  
##  Max.   : 5.0848   Max.   : 5.28587   Max.   : 5.5040   Max.   : 5.9136

Normalize Data

library(caret)
# calculate the pre-process parameters from the dataset
abalone_normalize_Params <- preProcess(abalone_data[,2:9], method= c('range'))
# summarize transofrm parameters
print(abalone_normalize_Params)

## Created from 4177 samples and 8 variables
## 
## Pre-processing:
##   - ignored (0)
##   - re-scaling to [0, 1] (8)

# transform the dataset using the parameters
abalone_normalized <- predict(abalone_normalize_Params, abalone_data[,2:9])
summary(abalone_normalized)

##        V2               V3               V4               V5        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.5068   1st Qu.:0.4958   1st Qu.:0.1018   1st Qu.:0.1557  
##  Median :0.6351   Median :0.6218   Median :0.1239   Median :0.2825  
##  Mean   :0.6067   Mean   :0.5931   Mean   :0.1235   Mean   :0.2928  
##  3rd Qu.:0.7297   3rd Qu.:0.7143   3rd Qu.:0.1460   3rd Qu.:0.4077  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##        V6               V7               V8               V9        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.1244   1st Qu.:0.1224   1st Qu.:0.1281   1st Qu.:0.2500  
##  Median :0.2253   Median :0.2245   Median :0.2317   Median :0.2857  
##  Mean   :0.2410   Mean   :0.2371   Mean   :0.2365   Mean   :0.3191  
##  3rd Qu.:0.3369   3rd Qu.:0.3325   3rd Qu.:0.3264   3rd Qu.:0.3571  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

Box-Cox Transform

Box-Cox method can be only applied on datasets with positive values. So, we first need to check if there is any non-positive values exists in this dataset.

# load packages
library(mlbench)
library(caret)
# Check if there is non-positive number in the dataset
cat('The data has ',sum(sign(abalone_data[,2:9]) == -1) + sum(sign(abalone_data[,2:9]) == 0 ), ' non-positive number.')

## The data has  2  non-positive number.

print('Because we have non-positive variables, we cannot apply box-cox method.\n')

## [1] "Because we have non-positive variables, we cannot apply box-cox method.\n"

Yeo-Johnson Transform

# load packages
library(mlbench)
library(caret)
# calculate the pre-process parameters from the dataset
abalone_YeoJohnson_Params <- preProcess(abalone_data[,2:9], method= c('YeoJohnson'))
# summarize transform parameters
print(abalone_YeoJohnson_Params)

## Created from 4177 samples and 6 variables
## 
## Pre-processing:
##   - ignored (0)
##   - Yeo-Johnson transformation (6)
## 
## Lambda estimates for Yeo-Johnson transformation:
## -1.7, 0.06, -0.81, -1.8, -1.14, 0.05

# transform the dataset using the parameters
abalone_YeoJohnson <- predict(abalone_YeoJohnson_Params, abalone_data[,2:9])
# summarize the transformed dataset (note pedigree and age)
summary(abalone_YeoJohnson)

##        V2              V3               V4                V5          
##  Min.   :0.075   Min.   :0.0550   Min.   :0.00000   Min.   :0.001998  
##  1st Qu.:0.450   1st Qu.:0.3500   1st Qu.:0.09937   1st Qu.:0.370047  
##  Median :0.545   Median :0.4250   Median :0.11746   Median :0.598825  
##  Mean   :0.524   Mean   :0.4079   Mean   :0.11575   Mean   :0.580887  
##  3rd Qu.:0.615   3rd Qu.:0.4800   3rd Qu.:0.13450   3rd Qu.:0.786217  
##  Max.   :0.815   Max.   :0.6500   Max.   :0.42551   Max.   :1.401684  
##        V6                  V7                  V8                 V9       
##  Min.   :0.0009991   Min.   :0.0004997   Min.   :0.001498   Min.   :0.706  
##  1st Qu.:0.1593278   1st Qu.:0.0825799   1st Qu.:0.114075   1st Qu.:2.330  
##  Median :0.2582102   Median :0.1374588   Median :0.186930   Median :2.449  
##  Mean   :0.2538275   Mean   :0.1349303   Mean   :0.179761   Mean   :2.506  
##  3rd Qu.:0.3465808   3rd Qu.:0.1854611   3rd Qu.:0.242868   3rd Qu.:2.656  
##  Max.   :0.6445799   Max.   :0.3551130   Max.   :0.480061   Max.   :3.726

Prinicipal Component Analysis Transform

# load packages
library(mlbench)
# calculate the pre-process parameters from the dataset
abalone_PCA_Params <- preProcess(abalone_data[,2:9], method= c('center','scale','pca'))
# summarize transform parameters
print(abalone_PCA_Params)

## Created from 4177 samples and 8 variables
## 
## Pre-processing:
##   - centered (8)
##   - ignored (0)
##   - principal component signal extraction (8)
##   - scaled (8)
## 
## PCA needed 3 components to capture 95 percent of the variance

# transform the dataset using the parameters
abalone_PCA <- predict(abalone_PCA_Params, abalone_data[,2:9])
# summarize the transformed dataset (note pedigree and age)
summary(abalone_PCA)

##       PC1               PC2               PC3            
##  Min.   :-6.9040   Min.   :-4.9008   Min.   :-21.743927  
##  1st Qu.:-1.8938   1st Qu.:-0.3200   1st Qu.: -0.210454  
##  Median : 0.1279   Median : 0.1932   Median : -0.008906  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   :  0.000000  
##  3rd Qu.: 1.8451   3rd Qu.: 0.5185   3rd Qu.:  0.204586  
##  Max.   : 8.6382   Max.   : 2.4622   Max.   :  3.125645

Independent Component Analysis Transform

# install packages
# install.packages('fastICA')
# load packages
library(mlbench)
library(caret)
# calculate the pre-process parameters from the dataset
abalone_ICA_Params <- preProcess(abalone_data[,2:9], method= c('center','scale','ica'),n.comp= 5)
# summarize transform parameters
print(abalone_ICA_Params)

## Created from 4177 samples and 8 variables
## 
## Pre-processing:
##   - centered (8)
##   - independent component signal extraction (8)
##   - ignored (0)
##   - scaled (8)
## 
## ICA used 5 components

# transform the dataset using the parameters
abalone_ICA <- predict(abalone_ICA_Params, abalone_data[,2:9])
# summarize the transformed dataset
summary(abalone_ICA)

##       ICA1              ICA2               ICA3              ICA4         
##  Min.   :-7.8012   Min.   :-10.9249   Min.   :-2.4570   Min.   :-6.75129  
##  1st Qu.:-0.4419   1st Qu.: -0.3061   1st Qu.:-0.6322   1st Qu.:-0.40339  
##  Median : 0.2806   Median :  0.2197   Median :-0.2409   Median :-0.04277  
##  Mean   : 0.0000   Mean   :  0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.6788   3rd Qu.:  0.5445   3rd Qu.: 0.3571   3rd Qu.: 0.35159  
##  Max.   : 1.8939   Max.   :  4.2732   Max.   : 6.4480   Max.   :45.80818  
##       ICA5        
##  Min.   :-2.6445  
##  1st Qu.:-0.6300  
##  Median :-0.2608  
##  Mean   : 0.0000  
##  3rd Qu.: 0.3217  
##  Max.   : 5.7875

Data Visualization

For visualizing the data, I used ggpairs plot since it includes all the important plots.

library(GGally)
ggpairs(abalone_data, columns = 1:8)

Data Interpretation

The above figure provides comprehensive information about the data.
The plots on the diagonal are the distributions of each variable.
Because V1 is a factor with three levels, we have a barplot for it.
The correlation results represented on top of the diagonal again indicate that V2 and V3 are highly correlated.
We can confirm this by checking the scatter plot of these two variables on the bottom of the diagonal. It kind of resembles a line. Apparently, it is better to only consider one of these variables in our analysis.
In general, all the variables are correlated with each other. The minimum correlation is 0.775.
On the first row of the figure, we have box plots of each variable, which shows the median, 1st, and 3rd quartiles for each variable. Because we have 3 factors for V1, the box plots are shown for each factor separately to simplify the interpretation of the results. For example, for most of the variables, level 2 has a lower median.
This could also be seen on the first column of the figure, where we have histogram of each variable versus the three levels of sex.

Wheat Seed Dataset

This data is about three types of wheat seeds: Kama, Rosa and Canadian (Variable 8 shows type of the seeds). 70 elements from each type have been randomly selected for the experiment. To construct the data, seven geometric parameters of wheat kernels were measured:

area A,
perimeter P,
compactness C = 4piA/P^2,
length of kernel,
width of kernel,
asymmetry coefficient
length of kernel groove.

More about the Wheat Seeds dataset

seeds_data <- read.table("seeds/seeds_dataset.txt")
seeds_data[,8] <- as.factor(seeds_data[,8])
print('Below is the size and structure of the data and type of each variable:\n')

## [1] "Below is the size and structure of the data and type of each variable:\n"

str(seeds_data)

## 'data.frame':    210 obs. of  8 variables:
##  $ V1: num  15.3 14.9 14.3 13.8 16.1 ...
##  $ V2: num  14.8 14.6 14.1 13.9 15 ...
##  $ V3: num  0.871 0.881 0.905 0.895 0.903 ...
##  $ V4: num  5.76 5.55 5.29 5.32 5.66 ...
##  $ V5: num  3.31 3.33 3.34 3.38 3.56 ...
##  $ V6: num  2.22 1.02 2.7 2.26 1.35 ...
##  $ V7: num  5.22 4.96 4.83 4.8 5.17 ...
##  $ V8: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...

cat("Number of missing values: ", sum(is.na(seeds_data)),'\n')

## Number of missing values:  0

print('Below is the summary of key statistics of the data:\n')

## [1] "Below is the summary of key statistics of the data:\n"

summary(seeds_data)

##        V1              V2              V3               V4       
##  Min.   :10.59   Min.   :12.41   Min.   :0.8081   Min.   :4.899  
##  1st Qu.:12.27   1st Qu.:13.45   1st Qu.:0.8569   1st Qu.:5.262  
##  Median :14.36   Median :14.32   Median :0.8734   Median :5.524  
##  Mean   :14.85   Mean   :14.56   Mean   :0.8710   Mean   :5.629  
##  3rd Qu.:17.30   3rd Qu.:15.71   3rd Qu.:0.8878   3rd Qu.:5.980  
##  Max.   :21.18   Max.   :17.25   Max.   :0.9183   Max.   :6.675  
##        V5              V6               V7        V8    
##  Min.   :2.630   Min.   :0.7651   Min.   :4.519   1:70  
##  1st Qu.:2.944   1st Qu.:2.5615   1st Qu.:5.045   2:70  
##  Median :3.237   Median :3.5990   Median :5.223   3:70  
##  Mean   :3.259   Mean   :3.7002   Mean   :5.408         
##  3rd Qu.:3.562   3rd Qu.:4.7687   3rd Qu.:5.877         
##  Max.   :4.033   Max.   :8.4560   Max.   :6.550

Check correlations

cor(seeds_data[,1:7])

##            V1         V2         V3         V4         V5          V6
## V1  1.0000000  0.9943409  0.6082884  0.9499854  0.9707706 -0.22957233
## V2  0.9943409  1.0000000  0.5292436  0.9724223  0.9448294 -0.21734037
## V3  0.6082884  0.5292436  1.0000000  0.3679151  0.7616345 -0.33147087
## V4  0.9499854  0.9724223  0.3679151  1.0000000  0.8604149 -0.17156243
## V5  0.9707706  0.9448294  0.7616345  0.8604149  1.0000000 -0.25803655
## V6 -0.2295723 -0.2173404 -0.3314709 -0.1715624 -0.2580365  1.00000000
## V7  0.8636927  0.8907839  0.2268248  0.9328061  0.7491315 -0.01107902
##             V7
## V1  0.86369275
## V2  0.89078390
## V3  0.22682482
## V4  0.93280609
## V5  0.74913147
## V6 -0.01107902
## V7  1.00000000

Based on the correlation result, variables V1 and V2 have the highest correlation.

Scale Data

library(caret)
# calculate the pre-process parameters from the dataset
seeds_scale_Params <- preProcess(seeds_data[,1:7], method= c('scale'))
# summarize transform parameters
print(seeds_scale_Params)

## Created from 210 samples and 7 variables
## 
## Pre-processing:
##   - ignored (0)
##   - scaled (7)

# transform the dataset using the parameters
seeds_scaled <- predict(seeds_scale_Params, seeds_data[,1:7])
summary(seeds_scaled)

##        V1              V2               V3              V4       
##  Min.   :3.640   Min.   : 9.503   Min.   :34.20   Min.   :11.06  
##  1st Qu.:4.217   1st Qu.:10.299   1st Qu.:36.26   1st Qu.:11.88  
##  Median :4.933   Median :10.965   Median :36.96   Median :12.47  
##  Mean   :5.103   Mean   :11.148   Mean   :36.86   Mean   :12.70  
##  3rd Qu.:5.947   3rd Qu.:12.033   3rd Qu.:37.57   3rd Qu.:13.50  
##  Max.   :7.279   Max.   :13.209   Max.   :38.86   Max.   :15.07  
##        V5               V6               V7        
##  Min.   : 6.963   Min.   :0.5089   Min.   : 9.195  
##  1st Qu.: 7.794   1st Qu.:1.7036   1st Qu.:10.265  
##  Median : 8.570   Median :2.3937   Median :10.627  
##  Mean   : 8.627   Mean   :2.4610   Mean   :11.004  
##  3rd Qu.: 9.430   3rd Qu.:3.1716   3rd Qu.:11.958  
##  Max.   :10.677   Max.   :5.6240   Max.   :13.327

Center Data

# load packages
library(caret)
# calculate the pre-process parameters from the dataset
seed_center_Params <- preProcess(seeds_data[,1:7], method= c('center'))
# summarize transform parameters
print(seed_center_Params)

## Created from 210 samples and 7 variables
## 
## Pre-processing:
##   - centered (7)
##   - ignored (0)

# transform the dataset using the parameters
seed_centered <- predict(seed_center_Params, seeds_data[,1:7])
summary(seed_centered)

##        V1                V2                V3                  V4         
##  Min.   :-4.2575   Min.   :-2.1493   Min.   :-0.062899   Min.   :-0.7295  
##  1st Qu.:-2.5775   1st Qu.:-1.1093   1st Qu.:-0.014099   1st Qu.:-0.3663  
##  Median :-0.4925   Median :-0.2393   Median : 0.002451   Median :-0.1050  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.000000   Mean   : 0.0000  
##  3rd Qu.: 2.4575   3rd Qu.: 1.1557   3rd Qu.: 0.016776   3rd Qu.: 0.3512  
##  Max.   : 6.3325   Max.   : 2.6907   Max.   : 0.047301   Max.   : 1.0465  
##        V5                V6                V7         
##  Min.   :-0.6286   Min.   :-2.9351   Min.   :-0.8891  
##  1st Qu.:-0.3146   1st Qu.:-1.1387   1st Qu.:-0.3631  
##  Median :-0.0216   Median :-0.1012   Median :-0.1851  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3031   3rd Qu.: 1.0685   3rd Qu.: 0.4689  
##  Max.   : 0.7744   Max.   : 4.7558   Max.   : 1.1419

Standardize Data

# load packages
library(caret)
# calculate the pre-process parameters from the dataset
seed_standardize_Params <- preProcess(seeds_data[,1:7], method= c('center','scale'))
# summarize transform parameters
print(seed_standardize_Params)

## Created from 210 samples and 7 variables
## 
## Pre-processing:
##   - centered (7)
##   - ignored (0)
##   - scaled (7)

# transform the dataset using the parameters
seed_standardized <- predict(seed_standardize_Params, seeds_data[,1:7])
summary(seed_standardized)

##        V1                V2                V3                V4         
##  Min.   :-1.4632   Min.   :-1.6458   Min.   :-2.6619   Min.   :-1.6466  
##  1st Qu.:-0.8858   1st Qu.:-0.8494   1st Qu.:-0.5967   1st Qu.:-0.8267  
##  Median :-0.1693   Median :-0.1832   Median : 0.1037   Median :-0.2371  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.8446   3rd Qu.: 0.8850   3rd Qu.: 0.7100   3rd Qu.: 0.7927  
##  Max.   : 2.1763   Max.   : 2.0603   Max.   : 2.0018   Max.   : 2.3619  
##        V5                V6                 V7         
##  Min.   :-1.6642   Min.   :-1.95210   Min.   :-1.8090  
##  1st Qu.:-0.8329   1st Qu.:-0.75734   1st Qu.:-0.7387  
##  Median :-0.0572   Median :-0.06731   Median :-0.3766  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.8026   3rd Qu.: 0.71068   3rd Qu.: 0.9541  
##  Max.   : 2.0502   Max.   : 3.16303   Max.   : 2.3234

Normalize Data

library(caret)
# calculate the pre-process parameters from the dataset
seed_normalize_Params <- preProcess(seeds_data[,1:7], method= c('range'))
# summarize transofrm parameters
print(seed_normalize_Params)

## Created from 210 samples and 7 variables
## 
## Pre-processing:
##   - ignored (0)
##   - re-scaling to [0, 1] (7)

# transform the dataset using the parameters
seed_normalized <- predict(seed_normalize_Params, seeds_data[,1:7])
summary(seed_normalized)

##        V1               V2               V3               V4        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.1586   1st Qu.:0.2149   1st Qu.:0.4428   1st Qu.:0.2045  
##  Median :0.3555   Median :0.3946   Median :0.5930   Median :0.3516  
##  Mean   :0.4020   Mean   :0.4441   Mean   :0.5708   Mean   :0.4108  
##  3rd Qu.:0.6341   3rd Qu.:0.6829   3rd Qu.:0.7230   3rd Qu.:0.6085  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##        V5               V6               V7        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2238   1st Qu.:0.2336   1st Qu.:0.2590  
##  Median :0.4326   Median :0.3685   Median :0.3466  
##  Mean   :0.4480   Mean   :0.3816   Mean   :0.4378  
##  3rd Qu.:0.6641   3rd Qu.:0.5206   3rd Qu.:0.6686  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

Box-Cox Transform

Box-Cox method can be only applied on datasets with positive values. So, we first need to check if there is any non-positive values exists in this dataset.

# load packages
library(mlbench)
library(caret)
# Check if there is non-positive number in the dataset
cat('The data has ',sum(sign(seeds_data[,1:7]) == -1) + sum(sign(seeds_data[,1:7]) == 0 ), ' non-positive number.\n')

## The data has  0  non-positive number.

# calculate the pre-process parameters from the dataset
seed_coxcox_Params <- preProcess(seeds_data[,1:7], method= c('BoxCox'))
# summarize transform parameters
print(seed_coxcox_Params)

## Created from 210 samples and 7 variables
## 
## Pre-processing:
##   - Box-Cox transformation (7)
##   - ignored (0)
## 
## Lambda estimates for Box-Cox transformation:
## -0.6, -2, 2, -2, 0.1, 0.6, -2

# transform the dataset using the parameters
seed_boxcox <- predict(seed_coxcox_Params, seeds_data[,1:7])
# summarize the transformed dataset (note pedigree and age)
summary(seed_boxcox)

##        V1              V2               V3                 V4        
##  Min.   :1.262   Min.   :0.4968   Min.   :-0.17349   Min.   :0.4792  
##  1st Qu.:1.296   1st Qu.:0.4972   1st Qu.:-0.13286   1st Qu.:0.4819  
##  Median :1.330   Median :0.4976   Median :-0.11854   Median :0.4836  
##  Mean   :1.330   Mean   :0.4976   Mean   :-0.12040   Mean   :0.4839  
##  3rd Qu.:1.365   3rd Qu.:0.4980   3rd Qu.:-0.10593   3rd Qu.:0.4860  
##  Max.   :1.400   Max.   :0.4983   Max.   :-0.07836   Max.   :0.4888  
##        V5              V6                V7        
##  Min.   :0.967   Min.   :-0.2473   Min.   :0.4755  
##  1st Qu.:1.080   1st Qu.: 1.2638   1st Qu.:0.4804  
##  Median :1.175   Median : 1.9272   Median :0.4817  
##  Mean   :1.175   Mean   : 1.9116   Mean   :0.4825  
##  3rd Qu.:1.270   3rd Qu.: 2.5883   3rd Qu.:0.4855  
##  Max.   :1.395   Max.   : 4.3333   Max.   :0.4883

Yeo-Johnson Transform

# load packages
library(mlbench)
library(caret)
# calculate the pre-process parameters from the dataset
seed_YeoJohnson_Params <- preProcess(seeds_data[,1:7], method= c('YeoJohnson'))
# summarize transform parameters
print(seed_YeoJohnson_Params)

## Created from 210 samples and 4 variables
## 
## Pre-processing:
##   - ignored (0)
##   - Yeo-Johnson transformation (4)
## 
## Lambda estimates for Yeo-Johnson transformation:
## -0.7, -2.28, -0.12, 0.46

# transform the dataset using the parameters
seed_YeoJohnson <- predict(seed_YeoJohnson_Params, seeds_data[,1:7])
# summarize the transformed dataset (note pedigree and age)
summary(seed_YeoJohnson)

##        V1              V2               V3               V4       
##  Min.   :1.175   Min.   :0.4367   Min.   :0.8081   Min.   :4.899  
##  1st Qu.:1.198   1st Qu.:0.4369   1st Qu.:0.8569   1st Qu.:5.262  
##  Median :1.221   Median :0.4370   Median :0.8734   Median :5.524  
##  Mean   :1.222   Mean   :0.4370   Mean   :0.8710   Mean   :5.629  
##  3rd Qu.:1.246   3rd Qu.:0.4372   3rd Qu.:0.8878   3rd Qu.:5.980  
##  Max.   :1.270   Max.   :0.4373   Max.   :0.9183   Max.   :6.675  
##        V5              V6               V7       
##  Min.   :1.197   Min.   :0.6487   Min.   :4.519  
##  1st Qu.:1.268   1st Qu.:1.7211   1st Qu.:5.045  
##  Median :1.329   Median :2.2053   Median :5.223  
##  Mean   :1.329   Mean   :2.1918   Mean   :5.408  
##  3rd Qu.:1.391   3rd Qu.:2.6844   3rd Qu.:5.877  
##  Max.   :1.473   Max.   :3.9179   Max.   :6.550

Prinicipal Component Analysis Transform

# load packages
library(mlbench)
# calculate the pre-process parameters from the dataset
seed_PCA_Params <- preProcess(seeds_data[,1:7], method= c('center','scale','pca'))
# summarize transform parameters
print(seed_PCA_Params)

## Created from 210 samples and 7 variables
## 
## Pre-processing:
##   - centered (7)
##   - ignored (0)
##   - principal component signal extraction (7)
##   - scaled (7)
## 
## PCA needed 3 components to capture 95 percent of the variance

# transform the dataset using the parameters
seed_PCA <- predict(seed_PCA_Params, seeds_data[,1:7])
# summarize the transformed dataset (note pedigree and age)
summary(seed_PCA)

##       PC1               PC2               PC3          
##  Min.   :-4.4650   Min.   :-2.5342   Min.   :-1.60539  
##  1st Qu.:-1.8219   1st Qu.:-0.8087   1st Qu.:-0.58036  
##  Median : 0.3508   Median :-0.1274   Median :-0.04621  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 1.9078   3rd Qu.: 0.7861   3rd Qu.: 0.47846  
##  Max.   : 3.2911   Max.   : 2.7811   Max.   : 2.91454

Independent Component Analysis Transform

# install packages
# install.packages('fastICA')
# load packages
library(mlbench)
library(caret)
# calculate the pre-process parameters from the dataset
seed_ICA_Params <- preProcess(seeds_data[,1:7], method= c('center','scale','ica'),n.comp= 5)
# summarize transform parameters
print(seed_ICA_Params)

## Created from 210 samples and 7 variables
## 
## Pre-processing:
##   - centered (7)
##   - independent component signal extraction (7)
##   - ignored (0)
##   - scaled (7)
## 
## ICA used 5 components

# transform the dataset using the parameters
seed_ICA <- predict(seed_ICA_Params, seeds_data[,1:7])
# summarize the transformed dataset
summary(seed_ICA)

##       ICA1               ICA2              ICA3               ICA4        
##  Min.   :-3.37294   Min.   :-1.8690   Min.   :-2.26399   Min.   :-2.0022  
##  1st Qu.:-0.46181   1st Qu.:-0.9819   1st Qu.:-0.68141   1st Qu.:-0.8353  
##  Median : 0.06536   Median : 0.1982   Median : 0.03591   Median :-0.1338  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.63044   3rd Qu.: 0.8119   3rd Qu.: 0.68949   3rd Qu.: 0.8512  
##  Max.   : 2.21662   Max.   : 1.8839   Max.   : 2.88302   Max.   : 2.5194  
##       ICA5          
##  Min.   :-3.303668  
##  1st Qu.:-0.499120  
##  Median :-0.006891  
##  Mean   : 0.000000  
##  3rd Qu.: 0.583083  
##  Max.   : 5.158864

Data Visualization

For visualizing the data, I used ggpairs plot since it includes all the important plots.

library(GGally)
ggpairs(seeds_data, columns = 1:8)

Data Interpretation

The above figure provides comprehensive information about the data.
The plots on the diagonal are the distributions of each variable.
Because V8 is a factor, we have a barplot for it. The frequency is the same for each factor, because we have 70 samples/observations from each type of wheat seeds.
The correlation results represented on top of the diagonal again indicate that V1 and V2 are highly correlated.
We can confirm this by checking the scatter plot of these two variables on the bottom of the diagonal. It kind of resembles a line which shows a linear relationship between these two variables. Apparently, it is better to only consider one of these variables in our analysis.
V4 and V5 are also correlated with V1 and V2.
V3 and V6 have the least correlations with other variables. It is seen from their correlation values and scatter plots.
On the last column of the figure, we have box plots of each variable, which shows the median, 1st, and 3rd quartiles for each variable. Because we have 3 factors for V8, the box plots are shown for each factor separately to simplify the interpretation of the results. For example, for most of the variables, type 2 has a higher median.
This could also be seen on the bottom row of the figure, where we have histogram of each variable versus the three type of the wheat seeds.

Data Mining with R Final Project

2024-08-20

Soudeh Khoubrouy

Instructor: Charles Pierre, PhD, M.Sc. in Analytics

Abalone Dataset

More about the Abalone dataset

Check correlations

Scale Data

Center Data

Standardize Data

Normalize Data

Box-Cox Transform

Yeo-Johnson Transform

Prinicipal Component Analysis Transform

Independent Component Analysis Transform

Data Visualization

Data Interpretation

Wheat Seed Dataset

More about the Wheat Seeds dataset

Check correlations

Scale Data

Center Data

Standardize Data

Normalize Data

Box-Cox Transform

Yeo-Johnson Transform

Prinicipal Component Analysis Transform

Independent Component Analysis Transform

Data Visualization

Data Interpretation