The frequency distribution of genres in the music data.
Using the information above, the number of samples (12495) is largely greater than the number of predictors (195). Hence, it is visible to split the dataset into training and test set, this will enable the evaluation of model performance and tuning parameter selection.
Given the imbalance in the distribution of classes in the response variable, classical has the highest percentage and metal with the lowest, it might be suggested to use stratified random sampling method to split the dataset.
Also, because of the large sample size, resampling or cross-validation techniques might be used to estimate model performance. K-fold cross-validation with k as 5 or 10 would be less computationally expensive.
The createDataPartition function in the caret package would be used to split the dataset into training and test datasets:
The same createDataPartition function in the caret package would also be used to divide the training dataset into 5 or 10 folds such that the class distribution is sustainably maintained:
Seed should be set for reproducibility.
The frequency distribution of permeability value in the dataset.
Using the information above, the number of samples (165) is largely less than the number of predictors (1107). Because the sample size is small, it is not advised to split the dataset into training and test set, splitting the dataset into testing and training sets might affect the ability to get a good linkage between the predictors and the response variables. In this case, resampling techniques should be used to select tuning parameters and estimate performance.
From the figure above, the distribution of permeability value is skewed. Hence, given the imbalance in the distribution of the response variable, it might be suggested to use stratified random sampling method.
The createDataPartition function in the caret package would be used to create stratified sampling such that the class distribution is sustainably maintained. The following code would be used to create multiple iterations of 5 folds cross-validation:
Seed should be set for reproducibility.
data(ChemicalManufacturingProcess)
From the table above, the best setting is at 4 PLS, at one standard error it has the following boundaries:
Setting with 3 PLS (0.534) has R^2 that is better than the lower boundary (0.5142), hence, a model with 3 PLS components is the most parsimonious model (simpler).
error = c(0.0272, 0.0298, 0.0302, 0.0308, 0.0322, 0.0327, 0.0333, 0.0330, 0.0326, 0.0324)
mean = c(0.444, 0.500, 0.533, 0.545, 0.542, 0.537, 0.534, 0.534, 0.520, 0.507)
toler = round((mean - 0.545) / 0.545, 4)
Computed tolerance values
Given that a 10% loss is accepted, then the best optimal number of PLS components is at 2 PLS components.
Computed tolerance values
From the figure above the random forest has the highest value of R^2, albeit, the R^2 value for the SVM is relatively close to that of the random forest, with some overlap. Thus, the best models in terms of optimal R^2 values are random forest and support vector machine.
Given each modelās prediction time, model complexity, and R^2 estimates the SVM should be chosen since it is fairly fast and its R^2 is relatively close to the best R^2. However, this decision is subjective, the PLS and regression tree models could also be considered if the predictive function is needed to be recorded, although they give substantial low R^2.
data(oil)
tb = round(table(oilType) / 96, 2)
barchart(tb, horizontal = F, main = 'Percentage Distribution of in original samples')
dist = as.data.frame(round(table(oilType) / 96, 2))
sampNum = 60
set.seed(23123)
list_table = vector(mode = "list", length = 30)
for(i in 1:length(list_table))
list_table[[i]] = round(table(sample(oilType, size = sampNum)) / 60, 2)
barchart(list_table[[1]], horizontal = F, main = 'Percentage distribution in random samples - 1')
barchart(list_table[[2]], horizontal = F, main = 'Percentage distribution in random samples - 2')
Frequencies in the random sample differ from that of the original samples. 30 different random samplings of 60 samples each were further looked, yet there frequencies distribution differ from the original sample. In some instance, the frequency of āGā is zero, hence, the training set will not capture all the classes. This might be ineffective for modeling.
seed.set = 234901
list_caret = createDataPartition(oilType, p = 0.59, times = 30)
perc_caret = lapply(list_caret, function(x, y) round(table(y[x])/60, 2), y = oilType)
barchart(perc_caret[[1]], horizontal = F, main = 'Distribution using createDataPartition - 1')
barchart(perc_caret[[2]], horizontal = F, main = 'Distribution using createDataPartition - 2')
The createDataPartition function generates random samples that are significantly closer to the original sample in terms of the frequency distribution. It tends to relatively maintain the frequencies distribution in the original dataset. When compared with the use of random sampling, it produces a better result in terms of keeping the frequencies distribution of the original dataset. Also, this tends to include all the classes in the random sample selection.
In any case, where there is small sample size, it might be inefficient to partition the dataset into train and test datasets. This is because the train set might not be sufficient to capture all aspects of the predictors.
Hence, LOOCV would be a reasonable option to determine the performance of the model.
sample_size = c(10, 15, 20, 25, 30, 20, 20, 20, 20, 20)
accuracy = c(0.9, 0.9, 0.9, 0.9, 0.9, 0.75, 0.80, 0.85, 0.9, 0.95)
bin1 = binom.test(round(accuracy[1]*sample_size[1]), sample_size[1])
dt = t(as.data.frame(round(bin1$conf.int, 3)))
for(i in 2:10)
{
bin = binom.test(round(accuracy[i]*sample_size[i]), sample_size[i])
new_tb = t(as.data.frame(round(bin$conf.int, 3)))
dt = rbind(dt, new_tb)
}
rownames(dt) = NULL
colnames(dt) = c('lower_bound', 'upper_bound')
dt1 = data.frame(sample_size, accuracy)
dt2 = cbind(dt1, dt)
dt2$width = dt2$upper_bound - dt2$lower_bound
cat("Table of width using diffrent sample size and accuracy")
## Table of width using diffrent sample size and accuracy
kable(dt2)
sample_size | accuracy | lower_bound | upper_bound | width |
---|---|---|---|---|
10 | 0.90 | 0.555 | 0.997 | 0.442 |
15 | 0.90 | 0.681 | 0.998 | 0.317 |
20 | 0.90 | 0.683 | 0.988 | 0.305 |
25 | 0.90 | 0.688 | 0.975 | 0.287 |
30 | 0.90 | 0.735 | 0.979 | 0.244 |
20 | 0.75 | 0.509 | 0.913 | 0.404 |
20 | 0.80 | 0.563 | 0.943 | 0.380 |
20 | 0.85 | 0.621 | 0.968 | 0.347 |
20 | 0.90 | 0.683 | 0.988 | 0.305 |
20 | 0.95 | 0.751 | 0.999 | 0.248 |
From the table above, the width of the confidence interval for reduces as the sample size increases. Likewise, the width reduces as the accuracy increases. Hence, if accuracy cannot be increased, then increased sample size can aid a better model. Also, if sample size cannot be increased, then increased accuracy would result in a better model.