1. Introduction

This assignment was for a Data Mining class I took at Penn State. The assignment was a group project that we had three weeks to complete and present our results to the class. We were allowed to choose any dataset we wanted from the UCI repository. The requirements were to utilize logistic regression, linear, quadratic and regularized discriminant analysis as well as random forest, boosting and support vector machines. We were to compare methods and computing times and present results to the class.

Report Outline

  1. Project Description
  2. Variables of Interest
  3. Exploratory Data Analysis
  4. Statistical Analysis
  5. Recommendations

1. Project Description

Mining activity has long been associated with mining hazards, such as fires, floods, and toxic contaminants (Dozolme, P., 2016). Among these hazards, seismic hazards are the hardest to detect and predict (Sikora & Wrobel, 2010). Minimizing loss from seismic hazards requires advanced data collection and analysis. In recent years, more and more cutting-edge seismic and seismoacoustic monitoring systems have come about. Still, the disproportionate number of low-energy versus high-energy seismic phenomena (e.g. > \(10^4\)J) renders traditional analysis methods insufficient in making accurate predictions. To investigate these seismic hazards and explore more advance analysis technique we used the seismic-bumps data set provided by Sikora & Wrobel (2010), found in the UCI Machine Learning Repository. This seismic-bumps data set comes from a coal mine located in Poland and contains 2584 observations of 19 attributes.

2. Variables of Interest

Each observation summarizes seismic activity in the rock mass within one 8-hour shift. Note that the decision attribute, named “class”, has values 1 and 0. This variable is the response variable we use in this project. A class value of “1” is categorized as “hazardous state”, which essentially indicates a registered seismic bump with high energy (>\(10^4\)J) in the next shift. A class value “0” represents non-hazardous state in the next shift. Table 1 in the Appendix has a listing of all 18 variables with their descriptions.

Table I. Attribute information of the seismic-bumps dataset
Data Attributes Description Data Types
seismic result of shift seismic hazard assessment: ‘a’ - lack of hazard, ‘b’ - low hazard, ‘c’ - high hazard, ‘d’ - danger state Categorical
seismoacoustic result of shift seismic hazard assessment Categorical
shift type of a shift: ‘W’ - coal-getting, ‘N’ - preparation shift Categorical
genergy seismic energy recorded within previous shift by active geophones (GMax) monitoring the longwall Continuous
gpuls number of pulses recorded within previous shift by GMax Continuous
gdenergy deviation of recorded energy within previous shift from average energy recorded during eight previous shifts Continuous
gdpuls deviation of recorded pulses within previous shift from average number of pulses recorded during eight previous shifts Continuous
ghazard result of shift seismic hazard assessment by the seismoacoustic method based on registration coming from GMax Categorical
nbumps the number of seismic bumps recorded within previous shift Continuous
nbumps\(i\), \(i\in\{1,\ldots,9\}\) the number of seismic bumps (\(10^i-10^{i+1}\) J) registered within previous shift Continuous
energy total energy of seismic bumps registered within previous shift Continuous
maxenergy maximum energy of the seismic bumps registered within previous shift Continuous
class the decision attribute: ‘1’ - high energy seismic bump occurred in the next shift (‘hazardous state’), ‘0’ - no high energy seismic bumps occurred in th next shift (‘non-hazardous state’) Categorical

3. Exploratory Data Analysis

The response variable for the dataset is class. Observe the small number of hazardous responses, “1”, versus the large number of non-hazardous responses, “0”. It would be wise to reduce the dataset as much as possible to ensure that our analysis is as solid as possible.

Response Variable: Class
Class 0 1
Count 2414 170

First, we look at the relationship of the factor variables, seismic, seismoacoustic, shift, ghazard with the response variable class. We utilize the chi-square test to parse out any relationship between the reponse and the 4 factor variables. If there is not a relationship, we will assume the variables are indenpendent and remove them from the analysis. The results from the tests are in the Appendix, but tables are provided below. Results from the tests show that the variables seismic and shift have a relationship to the response variable. The other two do not provide any evidence to reject the Null Hypothesis and therefore will be removed for further analysis.

Next, we look at the variables nbumps1 \(\dots\) nbumps89. The variables nbumps89, nbumps7 and nbumps6 have no recorded data. These will be removed from further analysis. Below is a table for the remaining nbumps and their relationship to the repsonse variable class, e.g. the table shows 4 bumps being registered for 123 non-hazardous class and 20 hazardous class. The 5 Bumps variable is observed only 12 times in the entire data set. Since it is observed so little it will be revmoved from the analysis.

What goes bumps in the night
0 1
5 Bumps 11 1
4 Bumps 123 20
3 Bumps 442 55
2 Bumps 399 56
1 Bumps 567 31

Finally, we look at a correlation plot of the remaining variables to see if we can parse out any multicollinearity issues. The most glaring issue is between energy and max energy. The correlation is almost a perfect positive relationship. Energy captures the total amount of energy and max energy records the maximum energy produced by the event. Since it is almost 1 to 1, I will eliminate the maxenergy variable from the analysis. Other concerns are with the nbumps 1,2,3, gplus and genergy, and gdenergy and gplus. We will utilize regularization techniques to help deal with the remaining multicollinearity.

For the EDA:

  • Dataset was reduced from 19 variables to 11 predictor variables
  • Examining contigency tables we eliminated 2 variables - semiascoustic, ghazard
  • Examining the dataframe, we eliminated 4 variables - 8/9, 7,6,5, bumps
  • Examining the correlation plot, we eliminated 1 variable - maxenergy
  • Expect Shift and Seismic variables to be significant in the analysis
  • Bumps 1 to 4 look to be significant from the EDA
  • Need regularization techniques to deal with the highly correlated variables - nbumps 1,2,3, gplus and genergy, and gdenergy and gplus

Multicollinearity can be address by dimension reduction techniques such as PCA, step-wise regression, LASSO or ridge. In project 2, we utilized step-wise regression and LASSO to arrive at two candidate models. However, even with these dimension reduction techniques our models still performed poorly. Hopefully, to remedy this poor performance, we can utilize more advance techniques such as Boosting, Random Forest or Support Vector Machines. We only look at the model that was obtained through step-wise regression for logistic regression, LDA, QDA and RDA and use it as a benchmark to look at our advanced techniques.

Below is our model of eleven variables that will attempt to model the response variable class

\(class \sim \beta_0 + \beta_1\cdot seismic + \beta_2\cdot shift+ \beta_3\cdot genergy + \beta_4\cdot gdenergy + \beta_5\cdot gpuls + \beta_6\cdot gdpuls + \beta_7\cdot nbumps\) \(+ \beta_8\cdot nbumps2 + \beta_9\cdot nbumps3\) \(+ \beta_{10}\cdot nbumps4 + \beta_{11}\cdot energy + \epsilon, \quad \epsilon \sim N(0,\sigma^2\)

4. Statistical Analysis

In the 4.1 section, we report ROC curves, missclassification rates and computing time for logistic regression, LDA, QDA and RDA. The data set is divided into a train set and a test set. In section 4.2, we report the same (ROC curves, missclassification rates and computing time) for boosting, random forest and support vector machines. Effectiveness of a method will be determined by assessing the misclassification rate for each method, but we will also look at the receiver operating characteristic (ROC curve). In a ROC curve the true positive rate(TPR) is plotted in function of the false positive rate(FPR) for different cut-off points. Each point on the ROC curve represents a TPR/FPR pair corresponding to a particular decision threshold. The area under the curve (AUC) will be used to help judge the effectiveness of a method. Ideally, we would like an AUC value of 90% or higher to determine a “good fit” for the seismic data.

4.1 Logistic Regression, LDA, QDA, RDA

The first method we used was logistic regression. The full model was using all of the predictors in the training data set. Initially, we used a threshold probability of 0.5 to classify into state 0 or 1. This yields an overall error rate of 6.7% for the training data and 6.5% for the test data, with minimal improvement in sensitivity. The ROC curve for this model indicates that it is still not a great fit for the data. However, reducing the dimension of the full model through step-wise regression produces a 4% increase in AUC and a slight reduction in missclassification found in Table 1. Computing time in seconds is all provided in Table 1.

4.2 Linear Discriminant Analysis

It is reasonable to believe that each class is distributed normally with some different mean vector. Thus, we implemented an LDA approach. Using a classification threshold of 0.5 yields an overall error rate of 7.4% for the training data and 7.7% for the test data, but with much higher sensitivity than in the previous models. The group means suggest that a mining shift with a higher number of seismic bumps and associated higher released energy (measured in Joules) is correlated with hazard status of the mine in the subsequent shift. We performed linear discriminant analysis on the data after variable selection. For the reduced model we observed an error rate of 8.10% on the train data an error rate of 7.6% for the rest data. The AUC’s for the full model and step model are roughly the same as the logistic regression model. Computing time is available in Table 2.

2.3 Quadratic Discriminant Analysis

Quadratic discriminant analysis (QDA) provides an alternative approach to LDA in that QDA assumes each class has its own covariance matrix. This allows the method to be more flexible in some ways, although the singularity of the covariance matrices resulting from multicollinearity prohibited us from fitting a QDA model prior to variable selection. We observed an error rate of 15% for Training set and 16% for the Test set using the step-model. AUC’s for the Step model are lower when compared to the previous method. Computing times are available in table 3.

2.4 Regularized Discriminant Analysis

Similar to LDA, it is reasonable to believe that each class is distributed normally with some different mean vector and covariance matrix, as seen in quadratic discriminant analysis (QDA). RDA allows for a compromise between LDA and QDA. We implemented an RDA approach. Using a classification threshold of 0.5 yields an overall error rate of 7.6% for the training data and 8.2% for the test data. Similar results were obtained for the step model. The AUC’s are also observed as quite poor. Poor performance on the test data might be indicative of RDA over fitting for this model.

3 Boosting, Random Forest Classification, Support Vector Machines

3.1 Boosting

Next we performed boosting to our data set to see whether it brings improvement compared to previous methods. Boosting involves combining a large number of decision trees. In boosting, we slowly grow the tree according to residuals from the model. The construction of each tree depends strongly on the trees that have already been grown. (James et al., 2013)

We also did boosting on model 1 (from stepwise variable selection), but the original model performs best, when evaluating by misclassification rate (shown in the table below). Therefore, in the interest of space, we only report the original model’s ROC curves. The AUC for the train data is significantly better while the test data’s AUC is roughly equivalent to logistic regression and LDA.

Table 5 Boosting
model Full model Stepwise model
time 10.38 5.64
misclassification rate .057 .060

3.2 Random Forest Classification

Next, we use Random Forest classification method as it yields relatively better classification results among all tree-based methods. As opposed to growing a single decision tree (as in CART), random forest grows multiple trees, with having each split to consider only a subset of all predictors.Then it takes average of all trees to make final tree. In this way, random forest can reduce amount of potential correlation between trees and thereby help reduce the variance of the final tree. First, we used tuneRF function to find the optimal numbers of variables to try (mtry) splitting on at each node. We found mtry = 2 produces least out of the box (OBB) error, that means, 2 out of 15 predictors should be considered for each split.

Then, we applied Random Forest formula on both train and test data sets for the models derived before and after variable selection. In each cases, the number of tress we used is 1000. We also calculated the ‘variable importance’ in order to see relative importance of each variable in the classification process.

Random forest on full model

Here we performed random forest classification on training and test data sets individually, using mtry = 2 and ntree = 1000. We found slightly lower test missclassification rate (5.7%) than train’s (9.4%).

However, ROC curves show that predictions on test data set are less accurate than predictions on train data set. We also see from the Variable importance plot, that the most important variables are nbumps2, nbumps3, genergy, nbumps4, nbumps, maxenergy, gdenergy, gpuls, and energy. Variables like shift, ghazard, nbumps5 and seismoacoustic are of less important for predicting seismic events.

Random forest on stepwise model

Here we performed random forest classifications on resulting model from stepwise variable selection. Misclassification rates for stepwise model’s training and test data sets are 6.9% and 5.4%, respectively. This time the ROC curves revealed slightly improved test AUC. According to the variable importance plot, most important variables for predicting next seismic events seem to be nbumps and nbumps4. We can see from ROC curves, that train AUC is still slightly higher than test AUC.

At this point, it is clear that across all of the trees considered in the random forests so far, number of bumps in the previous shifts are by far the most important variables for predicting potential seismic events in future mining shift(s).

RF on Models Error Rate Important Variable(s)
Full Model (Train) 9.4% nbumps2,3,4 , genergy, nbumps, maxenergy, gdenergy, gpuls, and energy
Full Model (Test) 5.7% nbumps2,3,4 , genergy, nbumps, maxenergy, gdenergy, gpuls, and energy
Stepwise model (Train) 10% nbumps and nbumps4
Stepwise model (Test) 7.8% nbumps and nbumps4

|:————–|—————–:|——:| |Time elasped| Full Model: 5.09 |Stepwise Model: 1.90|

3.3 Support vector classifier and support vector machine

A support vector machine allows for the classification of data with, if desired, non-linear boundaries. Essentially, the idea is to project the data into a space of higher dimension, determine decision boundaries, and then project back into the original space. We attempted to fit this approach to our data, using a linear, radial, and polynomial kernel.

Fitting a model with a linear kernel is a support vector classifier with an \(n\times p\) data set, and can be computed by solving the optimization problem

\[\text{maximize}_{\beta_0,\beta_1,\ldots,\beta_p, \epsilon_1,\ldots,\epsilon_n} \quad M\]

\[\text{subject to } \sum_{j=1}^p\beta_j^2=1,\]

\[y_i(\beta_0 +\beta_1x_{i1}+\beta_2x_{i2}+\ldots + \beta_px_{ip})\geq M(1-\epsilon),\]

However, replacing the inner product with a different, non-linear, kernel function permits different, more sophisticated, implementations of the support vector machine.

With an increased cost, the time to fit the prescribed SVM increased. Thankfully, our data favored lower costs which meant quicker computation in the end. For each kernel, we searched over a list of possible inputs for cost, gamma (where applicable), and degree (where applicable). Our final results are presented below.

Stepwise Model
Kernel linear radial polynomial
cost .001 .001 .001
gamma .2 1 .2
degree N/A N/A 2
time 5.4 26.47 25.31
misclassification rate .06 .06 .06

SVM missclassification rates are identical across the three kernels. However, the AUC’s from the ROC curve are quite poor, it would be better to randomly guess than use the SVM method.

4 Conclusions and future work

It is clear that variable selection in general improves the ability to predict hazardous events in the mine, according to our data. However, the improvements afforded are often minimal. In addition, while dimension reduction tends to dramatically improve sensitivity, this is often done at the cost of increasing overall error rate. Using the summaries of our findings, we recommend still using logistic regression to predict seismic events. LDA is a close second. However, random forest classification or boosting with a better understanding of its “fine-tuning” might achieve better results. Especially if more data was available.

For our final project using cluster analysis we will continue our exploration of seismic data. The Symposium on Advances in Artificial Intelligence and Applications had a competition in 2016 using a seismic data set produced from the same authors as this current data set. The original data set involves over 70,000 observations, but we will investigate a training set put forth in the competition that only has 13,000 observations.

Appendix