Directions: There are 10 very popular standard datasets for practicing data engineering and mining, and applied machine learning. These datasets are well researched, well studied and well used; and, they have been studied for quite a while. Please go to the internet and find at least two of these datasets. The datasets are the following: 1. Swedish Auto Insurance Dataset 2. Wine Quality Dataset 3. Pima Indians Diabetes Dataset 4. Sonar Dataset 5. Banknote Dataset 6. Iris Flowers Dataset 7. Abalone Dataset 8. Ionosphere Dataset 9. Wheat Seed Dataset 10. Boston House Price Dataset
rm(list = ls())
This dataset is about predicting the age of abalone from physical measurements. The age of abalone is determined by cutting its shell through the cone, staining it, and counting the number of rings through a microscope. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.
I downloaded this dataset from UCI website. The dataset does not have any missing value. Also, the ranges of the continuous values have been scaled for use with an ANN (by dividing by 200). The data has 8 features of abalone:
sex
length
diameter
height
whole weight
shucked weight
viscera weight
shell weight
and its rings as its dependent/target variable.
abalone_data <- read.table("abalone/abalone.data", sep = ',')
abalone_data[,1] <- as.factor(abalone_data[,1])
print('Below is the size and structure of the data and type of each variable:\n')
## [1] "Below is the size and structure of the data and type of each variable:\n"
str(abalone_data)
## 'data.frame': 4177 obs. of 9 variables:
## $ V1: Factor w/ 3 levels "F","I","M": 3 3 1 3 2 2 1 1 3 1 ...
## $ V2: num 0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
## $ V3: num 0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
## $ V4: num 0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
## $ V5: num 0.514 0.226 0.677 0.516 0.205 ...
## $ V6: num 0.2245 0.0995 0.2565 0.2155 0.0895 ...
## $ V7: num 0.101 0.0485 0.1415 0.114 0.0395 ...
## $ V8: num 0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
## $ V9: int 15 7 9 10 7 8 20 16 9 19 ...
print('The sex variable is a factor with three levels.\n')
## [1] "The sex variable is a factor with three levels.\n"
cat("Number of missing values: ", sum(is.na(abalone_data)),'\n')
## Number of missing values: 0
print('Below is the summary of key statistics of the data:\n')
## [1] "Below is the summary of key statistics of the data:\n"
summary(abalone_data)
## V1 V2 V3 V4 V5
## F:1307 Min. :0.075 Min. :0.0550 Min. :0.0000 Min. :0.0020
## I:1342 1st Qu.:0.450 1st Qu.:0.3500 1st Qu.:0.1150 1st Qu.:0.4415
## M:1528 Median :0.545 Median :0.4250 Median :0.1400 Median :0.7995
## Mean :0.524 Mean :0.4079 Mean :0.1395 Mean :0.8287
## 3rd Qu.:0.615 3rd Qu.:0.4800 3rd Qu.:0.1650 3rd Qu.:1.1530
## Max. :0.815 Max. :0.6500 Max. :1.1300 Max. :2.8255
## V6 V7 V8 V9
## Min. :0.0010 Min. :0.0005 Min. :0.0015 Min. : 1.000
## 1st Qu.:0.1860 1st Qu.:0.0935 1st Qu.:0.1300 1st Qu.: 8.000
## Median :0.3360 Median :0.1710 Median :0.2340 Median : 9.000
## Mean :0.3594 Mean :0.1806 Mean :0.2388 Mean : 9.934
## 3rd Qu.:0.5020 3rd Qu.:0.2530 3rd Qu.:0.3290 3rd Qu.:11.000
## Max. :1.4880 Max. :0.7600 Max. :1.0050 Max. :29.000
cor(abalone_data[,2:9])
## V2 V3 V4 V5 V6 V7 V8
## V2 1.0000000 0.9868116 0.8275536 0.9252612 0.8979137 0.9030177 0.8977056
## V3 0.9868116 1.0000000 0.8336837 0.9254521 0.8931625 0.8997244 0.9053298
## V4 0.8275536 0.8336837 1.0000000 0.8192208 0.7749723 0.7983193 0.8173380
## V5 0.9252612 0.9254521 0.8192208 1.0000000 0.9694055 0.9663751 0.9553554
## V6 0.8979137 0.8931625 0.7749723 0.9694055 1.0000000 0.9319613 0.8826171
## V7 0.9030177 0.8997244 0.7983193 0.9663751 0.9319613 1.0000000 0.9076563
## V8 0.8977056 0.9053298 0.8173380 0.9553554 0.8826171 0.9076563 1.0000000
## V9 0.5567196 0.5746599 0.5574673 0.5403897 0.4208837 0.5038192 0.6275740
## V9
## V2 0.5567196
## V3 0.5746599
## V4 0.5574673
## V5 0.5403897
## V6 0.4208837
## V7 0.5038192
## V8 0.6275740
## V9 1.0000000
Based on the correlation result, variables V2 and V3 have the highest correlation.
library(caret)
# calculate the pre-process parameters from the dataset
abalone_scale_Params <- preProcess(abalone_data[,2:9], method= c('scale'))
# summarize transform parameters
print(abalone_scale_Params)
## Created from 4177 samples and 8 variables
##
## Pre-processing:
## - ignored (0)
## - scaled (8)
# transform the dataset using the parameters
abalone_scaled <- predict(abalone_scale_Params, abalone_data[,2:9])
summary(abalone_scaled)
## V2 V3 V4 V5
## Min. :0.6245 Min. :0.5542 Min. : 0.000 Min. :0.004078
## 1st Qu.:3.7471 1st Qu.:3.5268 1st Qu.: 2.749 1st Qu.:0.900306
## Median :4.5382 Median :4.2826 Median : 3.347 Median :1.630338
## Mean :4.3632 Mean :4.1101 Mean : 3.336 Mean :1.689969
## 3rd Qu.:5.1210 3rd Qu.:4.8368 3rd Qu.: 3.945 3rd Qu.:2.351195
## Max. :6.7864 Max. :6.5498 Max. :27.016 Max. :5.761752
## V6 V7 V8 V9
## Min. :0.004505 Min. :0.004561 Min. :0.01078 Min. :0.3102
## 1st Qu.:0.837978 1st Qu.:0.852991 1st Qu.:0.93389 1st Qu.:2.4813
## Median :1.513766 Median :1.560016 Median :1.68100 Median :2.7914
## Mean :1.619043 Mean :1.647538 Mean :1.71571 Mean :3.0810
## 3rd Qu.:2.261639 3rd Qu.:2.308094 3rd Qu.:2.36346 3rd Qu.:3.4117
## Max. :6.703822 Max. :6.933405 Max. :7.21969 Max. :8.9946
# load packages
library(caret)
# calculate the pre-process parameters from the dataset
abalone_center_Params <- preProcess(abalone_data[,2:9], method= c('center'))
# summarize transform parameters
print(abalone_center_Params)
## Created from 4177 samples and 8 variables
##
## Pre-processing:
## - centered (8)
## - ignored (0)
# transform the dataset using the parameters
abalone_centered <- predict(abalone_center_Params, abalone_data[,2:9])
summary(abalone_centered)
## V2 V3 V4 V5
## Min. :-0.44899 Min. :-0.35288 Min. :-0.1395164 Min. :-0.82674
## 1st Qu.:-0.07399 1st Qu.:-0.05788 1st Qu.:-0.0245164 1st Qu.:-0.38724
## Median : 0.02101 Median : 0.01712 Median : 0.0004836 Median :-0.02924
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000000 Mean : 0.00000
## 3rd Qu.: 0.09101 3rd Qu.: 0.07212 3rd Qu.: 0.0254836 3rd Qu.: 0.32426
## Max. : 0.29101 Max. : 0.24212 Max. : 0.9904836 Max. : 1.99676
## V6 V7 V8 V9
## Min. :-0.35837 Min. :-0.180094 Min. :-0.237331 Min. :-8.9337
## 1st Qu.:-0.17337 1st Qu.:-0.087094 1st Qu.:-0.108831 1st Qu.:-1.9337
## Median :-0.02337 Median :-0.009594 Median :-0.004831 Median :-0.9337
## Mean : 0.00000 Mean : 0.000000 Mean : 0.000000 Mean : 0.0000
## 3rd Qu.: 0.14263 3rd Qu.: 0.072406 3rd Qu.: 0.090169 3rd Qu.: 1.0663
## Max. : 1.12863 Max. : 0.579406 Max. : 0.766169 Max. :19.0663
# load packages
library(caret)
# calculate the pre-process parameters from the dataset
abalone_standardize_Params <- preProcess(abalone_data[,2:9], method= c('center','scale'))
# summarize transform parameters
print(abalone_standardize_Params)
## Created from 4177 samples and 8 variables
##
## Pre-processing:
## - centered (8)
## - ignored (0)
## - scaled (8)
# transform the dataset using the parameters
abalone_standardized <- predict(abalone_standardize_Params, abalone_data[,2:9])
summary(abalone_standardized)
## V2 V3 V4 V5
## Min. :-3.7387 Min. :-3.5558 Min. :-3.33555 Min. :-1.68589
## 1st Qu.:-0.6161 1st Qu.:-0.5832 1st Qu.:-0.58614 1st Qu.:-0.78966
## Median : 0.1749 Median : 0.1725 Median : 0.01156 Median :-0.05963
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.7578 3rd Qu.: 0.7267 3rd Qu.: 0.60926 3rd Qu.: 0.66123
## Max. : 2.4232 Max. : 2.4397 Max. :23.68045 Max. : 4.07178
## V6 V7 V8 V9
## Min. :-1.6145 Min. :-1.64298 Min. :-1.7049 Min. :-2.7708
## 1st Qu.:-0.7811 1st Qu.:-0.79455 1st Qu.:-0.7818 1st Qu.:-0.5997
## Median :-0.1053 Median :-0.08752 Median :-0.0347 Median :-0.2896
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6426 3rd Qu.: 0.66056 3rd Qu.: 0.6478 3rd Qu.: 0.3307
## Max. : 5.0848 Max. : 5.28587 Max. : 5.5040 Max. : 5.9136
library(caret)
# calculate the pre-process parameters from the dataset
abalone_normalize_Params <- preProcess(abalone_data[,2:9], method= c('range'))
# summarize transofrm parameters
print(abalone_normalize_Params)
## Created from 4177 samples and 8 variables
##
## Pre-processing:
## - ignored (0)
## - re-scaling to [0, 1] (8)
# transform the dataset using the parameters
abalone_normalized <- predict(abalone_normalize_Params, abalone_data[,2:9])
summary(abalone_normalized)
## V2 V3 V4 V5
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.5068 1st Qu.:0.4958 1st Qu.:0.1018 1st Qu.:0.1557
## Median :0.6351 Median :0.6218 Median :0.1239 Median :0.2825
## Mean :0.6067 Mean :0.5931 Mean :0.1235 Mean :0.2928
## 3rd Qu.:0.7297 3rd Qu.:0.7143 3rd Qu.:0.1460 3rd Qu.:0.4077
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## V6 V7 V8 V9
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1244 1st Qu.:0.1224 1st Qu.:0.1281 1st Qu.:0.2500
## Median :0.2253 Median :0.2245 Median :0.2317 Median :0.2857
## Mean :0.2410 Mean :0.2371 Mean :0.2365 Mean :0.3191
## 3rd Qu.:0.3369 3rd Qu.:0.3325 3rd Qu.:0.3264 3rd Qu.:0.3571
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
Box-Cox method can be only applied on datasets with positive values. So, we first need to check if there is any non-positive values exists in this dataset.
# load packages
library(mlbench)
library(caret)
# Check if there is non-positive number in the dataset
cat('The data has ',sum(sign(abalone_data[,2:9]) == -1) + sum(sign(abalone_data[,2:9]) == 0 ), ' non-positive number.')
## The data has 2 non-positive number.
print('Because we have non-positive variables, we cannot apply box-cox method.\n')
## [1] "Because we have non-positive variables, we cannot apply box-cox method.\n"
# load packages
library(mlbench)
library(caret)
# calculate the pre-process parameters from the dataset
abalone_YeoJohnson_Params <- preProcess(abalone_data[,2:9], method= c('YeoJohnson'))
# summarize transform parameters
print(abalone_YeoJohnson_Params)
## Created from 4177 samples and 6 variables
##
## Pre-processing:
## - ignored (0)
## - Yeo-Johnson transformation (6)
##
## Lambda estimates for Yeo-Johnson transformation:
## -1.7, 0.06, -0.81, -1.8, -1.14, 0.05
# transform the dataset using the parameters
abalone_YeoJohnson <- predict(abalone_YeoJohnson_Params, abalone_data[,2:9])
# summarize the transformed dataset (note pedigree and age)
summary(abalone_YeoJohnson)
## V2 V3 V4 V5
## Min. :0.075 Min. :0.0550 Min. :0.00000 Min. :0.001998
## 1st Qu.:0.450 1st Qu.:0.3500 1st Qu.:0.09937 1st Qu.:0.370047
## Median :0.545 Median :0.4250 Median :0.11746 Median :0.598825
## Mean :0.524 Mean :0.4079 Mean :0.11575 Mean :0.580887
## 3rd Qu.:0.615 3rd Qu.:0.4800 3rd Qu.:0.13450 3rd Qu.:0.786217
## Max. :0.815 Max. :0.6500 Max. :0.42551 Max. :1.401684
## V6 V7 V8 V9
## Min. :0.0009991 Min. :0.0004997 Min. :0.001498 Min. :0.706
## 1st Qu.:0.1593278 1st Qu.:0.0825799 1st Qu.:0.114075 1st Qu.:2.330
## Median :0.2582102 Median :0.1374588 Median :0.186930 Median :2.449
## Mean :0.2538275 Mean :0.1349303 Mean :0.179761 Mean :2.506
## 3rd Qu.:0.3465808 3rd Qu.:0.1854611 3rd Qu.:0.242868 3rd Qu.:2.656
## Max. :0.6445799 Max. :0.3551130 Max. :0.480061 Max. :3.726
# load packages
library(mlbench)
# calculate the pre-process parameters from the dataset
abalone_PCA_Params <- preProcess(abalone_data[,2:9], method= c('center','scale','pca'))
# summarize transform parameters
print(abalone_PCA_Params)
## Created from 4177 samples and 8 variables
##
## Pre-processing:
## - centered (8)
## - ignored (0)
## - principal component signal extraction (8)
## - scaled (8)
##
## PCA needed 3 components to capture 95 percent of the variance
# transform the dataset using the parameters
abalone_PCA <- predict(abalone_PCA_Params, abalone_data[,2:9])
# summarize the transformed dataset (note pedigree and age)
summary(abalone_PCA)
## PC1 PC2 PC3
## Min. :-6.9040 Min. :-4.9008 Min. :-21.743927
## 1st Qu.:-1.8938 1st Qu.:-0.3200 1st Qu.: -0.210454
## Median : 0.1279 Median : 0.1932 Median : -0.008906
## Mean : 0.0000 Mean : 0.0000 Mean : 0.000000
## 3rd Qu.: 1.8451 3rd Qu.: 0.5185 3rd Qu.: 0.204586
## Max. : 8.6382 Max. : 2.4622 Max. : 3.125645
# install packages
# install.packages('fastICA')
# load packages
library(mlbench)
library(caret)
# calculate the pre-process parameters from the dataset
abalone_ICA_Params <- preProcess(abalone_data[,2:9], method= c('center','scale','ica'),n.comp= 5)
# summarize transform parameters
print(abalone_ICA_Params)
## Created from 4177 samples and 8 variables
##
## Pre-processing:
## - centered (8)
## - independent component signal extraction (8)
## - ignored (0)
## - scaled (8)
##
## ICA used 5 components
# transform the dataset using the parameters
abalone_ICA <- predict(abalone_ICA_Params, abalone_data[,2:9])
# summarize the transformed dataset
summary(abalone_ICA)
## ICA1 ICA2 ICA3 ICA4
## Min. :-7.8012 Min. :-10.9249 Min. :-2.4570 Min. :-6.75129
## 1st Qu.:-0.4419 1st Qu.: -0.3061 1st Qu.:-0.6322 1st Qu.:-0.40339
## Median : 0.2806 Median : 0.2197 Median :-0.2409 Median :-0.04277
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.6788 3rd Qu.: 0.5445 3rd Qu.: 0.3571 3rd Qu.: 0.35159
## Max. : 1.8939 Max. : 4.2732 Max. : 6.4480 Max. :45.80818
## ICA5
## Min. :-2.6445
## 1st Qu.:-0.6300
## Median :-0.2608
## Mean : 0.0000
## 3rd Qu.: 0.3217
## Max. : 5.7875
For visualizing the data, I used ggpairs plot since it includes all the important plots.
library(GGally)
ggpairs(abalone_data, columns = 1:8)
The above figure provides comprehensive information about the data.
The plots on the diagonal are the distributions of each variable.
Because V1 is a factor with three levels, we have a barplot for it.
The correlation results represented on top of the diagonal again indicate that V2 and V3 are highly correlated.
We can confirm this by checking the scatter plot of these two variables on the bottom of the diagonal. It kind of resembles a line. Apparently, it is better to only consider one of these variables in our analysis.
In general, all the variables are correlated with each other. The minimum correlation is 0.775.
On the first row of the figure, we have box plots of each variable, which shows the median, 1st, and 3rd quartiles for each variable. Because we have 3 factors for V1, the box plots are shown for each factor separately to simplify the interpretation of the results. For example, for most of the variables, level 2 has a lower median.
This could also be seen on the first column of the figure, where we have histogram of each variable versus the three levels of sex.
This data is about three types of wheat seeds: Kama, Rosa and Canadian (Variable 8 shows type of the seeds). 70 elements from each type have been randomly selected for the experiment. To construct the data, seven geometric parameters of wheat kernels were measured:
seeds_data <- read.table("seeds/seeds_dataset.txt")
seeds_data[,8] <- as.factor(seeds_data[,8])
print('Below is the size and structure of the data and type of each variable:\n')
## [1] "Below is the size and structure of the data and type of each variable:\n"
str(seeds_data)
## 'data.frame': 210 obs. of 8 variables:
## $ V1: num 15.3 14.9 14.3 13.8 16.1 ...
## $ V2: num 14.8 14.6 14.1 13.9 15 ...
## $ V3: num 0.871 0.881 0.905 0.895 0.903 ...
## $ V4: num 5.76 5.55 5.29 5.32 5.66 ...
## $ V5: num 3.31 3.33 3.34 3.38 3.56 ...
## $ V6: num 2.22 1.02 2.7 2.26 1.35 ...
## $ V7: num 5.22 4.96 4.83 4.8 5.17 ...
## $ V8: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
cat("Number of missing values: ", sum(is.na(seeds_data)),'\n')
## Number of missing values: 0
print('Below is the summary of key statistics of the data:\n')
## [1] "Below is the summary of key statistics of the data:\n"
summary(seeds_data)
## V1 V2 V3 V4
## Min. :10.59 Min. :12.41 Min. :0.8081 Min. :4.899
## 1st Qu.:12.27 1st Qu.:13.45 1st Qu.:0.8569 1st Qu.:5.262
## Median :14.36 Median :14.32 Median :0.8734 Median :5.524
## Mean :14.85 Mean :14.56 Mean :0.8710 Mean :5.629
## 3rd Qu.:17.30 3rd Qu.:15.71 3rd Qu.:0.8878 3rd Qu.:5.980
## Max. :21.18 Max. :17.25 Max. :0.9183 Max. :6.675
## V5 V6 V7 V8
## Min. :2.630 Min. :0.7651 Min. :4.519 1:70
## 1st Qu.:2.944 1st Qu.:2.5615 1st Qu.:5.045 2:70
## Median :3.237 Median :3.5990 Median :5.223 3:70
## Mean :3.259 Mean :3.7002 Mean :5.408
## 3rd Qu.:3.562 3rd Qu.:4.7687 3rd Qu.:5.877
## Max. :4.033 Max. :8.4560 Max. :6.550
cor(seeds_data[,1:7])
## V1 V2 V3 V4 V5 V6
## V1 1.0000000 0.9943409 0.6082884 0.9499854 0.9707706 -0.22957233
## V2 0.9943409 1.0000000 0.5292436 0.9724223 0.9448294 -0.21734037
## V3 0.6082884 0.5292436 1.0000000 0.3679151 0.7616345 -0.33147087
## V4 0.9499854 0.9724223 0.3679151 1.0000000 0.8604149 -0.17156243
## V5 0.9707706 0.9448294 0.7616345 0.8604149 1.0000000 -0.25803655
## V6 -0.2295723 -0.2173404 -0.3314709 -0.1715624 -0.2580365 1.00000000
## V7 0.8636927 0.8907839 0.2268248 0.9328061 0.7491315 -0.01107902
## V7
## V1 0.86369275
## V2 0.89078390
## V3 0.22682482
## V4 0.93280609
## V5 0.74913147
## V6 -0.01107902
## V7 1.00000000
Based on the correlation result, variables V1 and V2 have the highest correlation.
library(caret)
# calculate the pre-process parameters from the dataset
seeds_scale_Params <- preProcess(seeds_data[,1:7], method= c('scale'))
# summarize transform parameters
print(seeds_scale_Params)
## Created from 210 samples and 7 variables
##
## Pre-processing:
## - ignored (0)
## - scaled (7)
# transform the dataset using the parameters
seeds_scaled <- predict(seeds_scale_Params, seeds_data[,1:7])
summary(seeds_scaled)
## V1 V2 V3 V4
## Min. :3.640 Min. : 9.503 Min. :34.20 Min. :11.06
## 1st Qu.:4.217 1st Qu.:10.299 1st Qu.:36.26 1st Qu.:11.88
## Median :4.933 Median :10.965 Median :36.96 Median :12.47
## Mean :5.103 Mean :11.148 Mean :36.86 Mean :12.70
## 3rd Qu.:5.947 3rd Qu.:12.033 3rd Qu.:37.57 3rd Qu.:13.50
## Max. :7.279 Max. :13.209 Max. :38.86 Max. :15.07
## V5 V6 V7
## Min. : 6.963 Min. :0.5089 Min. : 9.195
## 1st Qu.: 7.794 1st Qu.:1.7036 1st Qu.:10.265
## Median : 8.570 Median :2.3937 Median :10.627
## Mean : 8.627 Mean :2.4610 Mean :11.004
## 3rd Qu.: 9.430 3rd Qu.:3.1716 3rd Qu.:11.958
## Max. :10.677 Max. :5.6240 Max. :13.327
# load packages
library(caret)
# calculate the pre-process parameters from the dataset
seed_center_Params <- preProcess(seeds_data[,1:7], method= c('center'))
# summarize transform parameters
print(seed_center_Params)
## Created from 210 samples and 7 variables
##
## Pre-processing:
## - centered (7)
## - ignored (0)
# transform the dataset using the parameters
seed_centered <- predict(seed_center_Params, seeds_data[,1:7])
summary(seed_centered)
## V1 V2 V3 V4
## Min. :-4.2575 Min. :-2.1493 Min. :-0.062899 Min. :-0.7295
## 1st Qu.:-2.5775 1st Qu.:-1.1093 1st Qu.:-0.014099 1st Qu.:-0.3663
## Median :-0.4925 Median :-0.2393 Median : 0.002451 Median :-0.1050
## Mean : 0.0000 Mean : 0.0000 Mean : 0.000000 Mean : 0.0000
## 3rd Qu.: 2.4575 3rd Qu.: 1.1557 3rd Qu.: 0.016776 3rd Qu.: 0.3512
## Max. : 6.3325 Max. : 2.6907 Max. : 0.047301 Max. : 1.0465
## V5 V6 V7
## Min. :-0.6286 Min. :-2.9351 Min. :-0.8891
## 1st Qu.:-0.3146 1st Qu.:-1.1387 1st Qu.:-0.3631
## Median :-0.0216 Median :-0.1012 Median :-0.1851
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3031 3rd Qu.: 1.0685 3rd Qu.: 0.4689
## Max. : 0.7744 Max. : 4.7558 Max. : 1.1419
# load packages
library(caret)
# calculate the pre-process parameters from the dataset
seed_standardize_Params <- preProcess(seeds_data[,1:7], method= c('center','scale'))
# summarize transform parameters
print(seed_standardize_Params)
## Created from 210 samples and 7 variables
##
## Pre-processing:
## - centered (7)
## - ignored (0)
## - scaled (7)
# transform the dataset using the parameters
seed_standardized <- predict(seed_standardize_Params, seeds_data[,1:7])
summary(seed_standardized)
## V1 V2 V3 V4
## Min. :-1.4632 Min. :-1.6458 Min. :-2.6619 Min. :-1.6466
## 1st Qu.:-0.8858 1st Qu.:-0.8494 1st Qu.:-0.5967 1st Qu.:-0.8267
## Median :-0.1693 Median :-0.1832 Median : 0.1037 Median :-0.2371
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.8446 3rd Qu.: 0.8850 3rd Qu.: 0.7100 3rd Qu.: 0.7927
## Max. : 2.1763 Max. : 2.0603 Max. : 2.0018 Max. : 2.3619
## V5 V6 V7
## Min. :-1.6642 Min. :-1.95210 Min. :-1.8090
## 1st Qu.:-0.8329 1st Qu.:-0.75734 1st Qu.:-0.7387
## Median :-0.0572 Median :-0.06731 Median :-0.3766
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.8026 3rd Qu.: 0.71068 3rd Qu.: 0.9541
## Max. : 2.0502 Max. : 3.16303 Max. : 2.3234
library(caret)
# calculate the pre-process parameters from the dataset
seed_normalize_Params <- preProcess(seeds_data[,1:7], method= c('range'))
# summarize transofrm parameters
print(seed_normalize_Params)
## Created from 210 samples and 7 variables
##
## Pre-processing:
## - ignored (0)
## - re-scaling to [0, 1] (7)
# transform the dataset using the parameters
seed_normalized <- predict(seed_normalize_Params, seeds_data[,1:7])
summary(seed_normalized)
## V1 V2 V3 V4
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1586 1st Qu.:0.2149 1st Qu.:0.4428 1st Qu.:0.2045
## Median :0.3555 Median :0.3946 Median :0.5930 Median :0.3516
## Mean :0.4020 Mean :0.4441 Mean :0.5708 Mean :0.4108
## 3rd Qu.:0.6341 3rd Qu.:0.6829 3rd Qu.:0.7230 3rd Qu.:0.6085
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## V5 V6 V7
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2238 1st Qu.:0.2336 1st Qu.:0.2590
## Median :0.4326 Median :0.3685 Median :0.3466
## Mean :0.4480 Mean :0.3816 Mean :0.4378
## 3rd Qu.:0.6641 3rd Qu.:0.5206 3rd Qu.:0.6686
## Max. :1.0000 Max. :1.0000 Max. :1.0000
Box-Cox method can be only applied on datasets with positive values. So, we first need to check if there is any non-positive values exists in this dataset.
# load packages
library(mlbench)
library(caret)
# Check if there is non-positive number in the dataset
cat('The data has ',sum(sign(seeds_data[,1:7]) == -1) + sum(sign(seeds_data[,1:7]) == 0 ), ' non-positive number.\n')
## The data has 0 non-positive number.
# calculate the pre-process parameters from the dataset
seed_coxcox_Params <- preProcess(seeds_data[,1:7], method= c('BoxCox'))
# summarize transform parameters
print(seed_coxcox_Params)
## Created from 210 samples and 7 variables
##
## Pre-processing:
## - Box-Cox transformation (7)
## - ignored (0)
##
## Lambda estimates for Box-Cox transformation:
## -0.6, -2, 2, -2, 0.1, 0.6, -2
# transform the dataset using the parameters
seed_boxcox <- predict(seed_coxcox_Params, seeds_data[,1:7])
# summarize the transformed dataset (note pedigree and age)
summary(seed_boxcox)
## V1 V2 V3 V4
## Min. :1.262 Min. :0.4968 Min. :-0.17349 Min. :0.4792
## 1st Qu.:1.296 1st Qu.:0.4972 1st Qu.:-0.13286 1st Qu.:0.4819
## Median :1.330 Median :0.4976 Median :-0.11854 Median :0.4836
## Mean :1.330 Mean :0.4976 Mean :-0.12040 Mean :0.4839
## 3rd Qu.:1.365 3rd Qu.:0.4980 3rd Qu.:-0.10593 3rd Qu.:0.4860
## Max. :1.400 Max. :0.4983 Max. :-0.07836 Max. :0.4888
## V5 V6 V7
## Min. :0.967 Min. :-0.2473 Min. :0.4755
## 1st Qu.:1.080 1st Qu.: 1.2638 1st Qu.:0.4804
## Median :1.175 Median : 1.9272 Median :0.4817
## Mean :1.175 Mean : 1.9116 Mean :0.4825
## 3rd Qu.:1.270 3rd Qu.: 2.5883 3rd Qu.:0.4855
## Max. :1.395 Max. : 4.3333 Max. :0.4883
# load packages
library(mlbench)
library(caret)
# calculate the pre-process parameters from the dataset
seed_YeoJohnson_Params <- preProcess(seeds_data[,1:7], method= c('YeoJohnson'))
# summarize transform parameters
print(seed_YeoJohnson_Params)
## Created from 210 samples and 4 variables
##
## Pre-processing:
## - ignored (0)
## - Yeo-Johnson transformation (4)
##
## Lambda estimates for Yeo-Johnson transformation:
## -0.7, -2.28, -0.12, 0.46
# transform the dataset using the parameters
seed_YeoJohnson <- predict(seed_YeoJohnson_Params, seeds_data[,1:7])
# summarize the transformed dataset (note pedigree and age)
summary(seed_YeoJohnson)
## V1 V2 V3 V4
## Min. :1.175 Min. :0.4367 Min. :0.8081 Min. :4.899
## 1st Qu.:1.198 1st Qu.:0.4369 1st Qu.:0.8569 1st Qu.:5.262
## Median :1.221 Median :0.4370 Median :0.8734 Median :5.524
## Mean :1.222 Mean :0.4370 Mean :0.8710 Mean :5.629
## 3rd Qu.:1.246 3rd Qu.:0.4372 3rd Qu.:0.8878 3rd Qu.:5.980
## Max. :1.270 Max. :0.4373 Max. :0.9183 Max. :6.675
## V5 V6 V7
## Min. :1.197 Min. :0.6487 Min. :4.519
## 1st Qu.:1.268 1st Qu.:1.7211 1st Qu.:5.045
## Median :1.329 Median :2.2053 Median :5.223
## Mean :1.329 Mean :2.1918 Mean :5.408
## 3rd Qu.:1.391 3rd Qu.:2.6844 3rd Qu.:5.877
## Max. :1.473 Max. :3.9179 Max. :6.550
# load packages
library(mlbench)
# calculate the pre-process parameters from the dataset
seed_PCA_Params <- preProcess(seeds_data[,1:7], method= c('center','scale','pca'))
# summarize transform parameters
print(seed_PCA_Params)
## Created from 210 samples and 7 variables
##
## Pre-processing:
## - centered (7)
## - ignored (0)
## - principal component signal extraction (7)
## - scaled (7)
##
## PCA needed 3 components to capture 95 percent of the variance
# transform the dataset using the parameters
seed_PCA <- predict(seed_PCA_Params, seeds_data[,1:7])
# summarize the transformed dataset (note pedigree and age)
summary(seed_PCA)
## PC1 PC2 PC3
## Min. :-4.4650 Min. :-2.5342 Min. :-1.60539
## 1st Qu.:-1.8219 1st Qu.:-0.8087 1st Qu.:-0.58036
## Median : 0.3508 Median :-0.1274 Median :-0.04621
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 1.9078 3rd Qu.: 0.7861 3rd Qu.: 0.47846
## Max. : 3.2911 Max. : 2.7811 Max. : 2.91454
# install packages
# install.packages('fastICA')
# load packages
library(mlbench)
library(caret)
# calculate the pre-process parameters from the dataset
seed_ICA_Params <- preProcess(seeds_data[,1:7], method= c('center','scale','ica'),n.comp= 5)
# summarize transform parameters
print(seed_ICA_Params)
## Created from 210 samples and 7 variables
##
## Pre-processing:
## - centered (7)
## - independent component signal extraction (7)
## - ignored (0)
## - scaled (7)
##
## ICA used 5 components
# transform the dataset using the parameters
seed_ICA <- predict(seed_ICA_Params, seeds_data[,1:7])
# summarize the transformed dataset
summary(seed_ICA)
## ICA1 ICA2 ICA3 ICA4
## Min. :-3.37294 Min. :-1.8690 Min. :-2.26399 Min. :-2.0022
## 1st Qu.:-0.46181 1st Qu.:-0.9819 1st Qu.:-0.68141 1st Qu.:-0.8353
## Median : 0.06536 Median : 0.1982 Median : 0.03591 Median :-0.1338
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.63044 3rd Qu.: 0.8119 3rd Qu.: 0.68949 3rd Qu.: 0.8512
## Max. : 2.21662 Max. : 1.8839 Max. : 2.88302 Max. : 2.5194
## ICA5
## Min. :-3.303668
## 1st Qu.:-0.499120
## Median :-0.006891
## Mean : 0.000000
## 3rd Qu.: 0.583083
## Max. : 5.158864
For visualizing the data, I used ggpairs plot since it includes all the important plots.
library(GGally)
ggpairs(seeds_data, columns = 1:8)
The above figure provides comprehensive information about the data.
The plots on the diagonal are the distributions of each variable.
Because V8 is a factor, we have a barplot for it. The frequency is the same for each factor, because we have 70 samples/observations from each type of wheat seeds.
The correlation results represented on top of the diagonal again indicate that V1 and V2 are highly correlated.
We can confirm this by checking the scatter plot of these two variables on the bottom of the diagonal. It kind of resembles a line which shows a linear relationship between these two variables. Apparently, it is better to only consider one of these variables in our analysis.
V4 and V5 are also correlated with V1 and V2.
V3 and V6 have the least correlations with other variables. It is seen from their correlation values and scatter plots.
On the last column of the figure, we have box plots of each variable, which shows the median, 1st, and 3rd quartiles for each variable. Because we have 3 factors for V8, the box plots are shown for each factor separately to simplify the interpretation of the results. For example, for most of the variables, type 2 has a higher median.
This could also be seen on the bottom row of the figure, where we have histogram of each variable versus the three type of the wheat seeds.