Introduction

We found the Forest Cover type dataset in the UCI Machine Learning Repository that takes forestry data from four wilderness areas in Roosevelt National Forest in northern Colorado (Click here for a tour of the area). The observations are taken from 30m by 30m patches of forest that are classified as one of seven cover types:

  1. Spruce/Fir
  2. Lodgepole Pine
  3. Ponderosa Pine
  4. Cottonwood/Willow
  5. Aspen
  6. Douglas-fir
  7. Krummholz


The actual forest cover type for a given observation was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Kaggle hosted the dataset in a competition with a training set of 15,120 observations and a test set of 565,892 observations. The relative sizes of the train and test sets makes classification of cover type a challenging problem. We decided to use the machine learning and visualization packages available in R for this project.

Data Exploration

Name Measurement Description
Elevation meters Elevation in meters
Aspect azimuth Aspect in degrees azimuth
Slope degrees Slope in degrees
Horizontal Distance To Hydrology meters Horz Dist to nearest surface water features
Vertical Distance To Hydrology meters Vert Dist to nearest surface water features
Horizontal Distance To Roadways meters Horz Dist to nearest roadway
Hillshade 9am 0 to 255 index Hillshade index at 9am, summer solstice
Hillshade Noon 0 to 255 index Hillshade index at noon, summer soltice
Hillshade 3pm 0 to 255 index Hillshade index at 3pm, summer solstice
Horizontal Distance To Fire Points meters Horz Dist to nearest wildfire ignition points
Wilderness Area (4 binary columns) 0 (absence) or 1 (presence) Wilderness area designation
Soil Type (40 binary columns) 0 (absence) or 1 (presence) Soil Type designation
Cover Type Classes 1 to 7 Forest Cover Type designation - Response Variable
The Kaggle training data was not a simple random sample of the entire dataset, but a stratified sample of the seven forest classes. The training data was equally distributed among the seven classes.

Some class seperation is clearly visible in the following plots of elevation.

Across Wilderness areas, there seems to be some class seperation. Cottonwood Willow covers are found only in Cache la Poudre areas and Neota areas comprises of only Spruce-fir, Krummholz, and Lodgepole Pine covers.

Within the training data, the presence of certain soil types reduces the probability of observing certain cover types. Soil type 7 and 15 had no observations in the training data, even though they were present in the test set. For the purposes of modeling, we had to ignore these features.

Another compelling variable is aspect, or the cardinal direction that the slope has the steepest gradient downwards. For example, in the rose diagram below, there is a greater prevalence of cottonwood/willow trees on patches where the slope faces an easterly direction.

When we look at the aspect by wilderness area, we find aspect to be equally distributed among the four wilderness locations.

Let us look at the correlation plots of the ten quantitative variables.

One of the clearer relationships in the dataset was the distribution of Hillshade luminance. Hillshade is measured on a scale from 0 to 255 (dark to bright).

Hillshade at time t varies as a factor of: \[\cos(slope)\cos (90- Altitude) + \sin (slope)\sin (90-Altitude)\cos(Azimuth-Aspect)\] where Altitude is the angle of the Sun relative to the horizon and Azimuth relates to the direction the Sun is facing: North, South, East, or West. Azimuth of 90 degrees corresponds to East.

This equation actually arises from a theorem in Spherical geometry known as “Spherical law of Cosines” relating the sides and angles of triangles constructed on spherical surfaces.


In a unit sphere, lengths a, b, c correspond to the angle subtended by those sides from the center of the sphere. If we know the two sides a, b and the angle between them C, then the cosine of c, is given by:

In short, the Illumination of the patch is related to alitude of the sun, slope of the terrain and the relative direction of the sun and the slope.

\(Hillshade(t_{9AM},t_{Noon},t_{3PM})\) has has been plotted below in 3 dimensions.

Click/drag on the plot to explore:

More importantly, in the context of class seperation, The following plot of elevation, slope, and hillshade clearly separates the forest cover types, demonstrating that elevation could be the most significant factor in determining cover type.

The primary reason for the collection of cartographic data pertains to terrain mapping. It is ultimately useful for applying topographic correction to satellite images in remote sensing or as background information in scientific studies.

Topographic correction is necessary if, for example, we wish to identify materials on the Earth’s surface by deriving empirical spectral signatures, or to compare images taken at different times with different Sun and satellite positions and angles. By applying the corrections, it is possible to transform the satellite-derived reflectance into their true reflectivity or radiance in horizontal conditions.

Modeling

Feature Engineering

Some of the features ultimately incorporated into our analysis are shown below.

New Feature Description
Hillshade_mean Mean hillshade of the hillshade at 9AM, Noon, and 3PM (0-255)
Euclidean_Distance_To_Hydrology Square root of the sum of the squared horizontal & vertical distances to water
aspect_group Grouping aspect into
log_elevation Logarithm of elevation
Hillshade_9am_sq 9AM hillshade squared
Hillshade_noon_sq Noon hillshade squared
Hillshade_3pm_sq 3PM hillshade squared
cosine_slope The cosine of the slope, used to partially model the relationships between hillshade
aspect_group_class Whether the aspect is large or small
soil_family What soil family a soil belongs to
soil_rock_type Whether soil is stony, rubbly, or neither
interaction_9amnoon Product of hillshades at 9AM and Noon
interaction_noon3pm Product of hillshades at Noon and 3PM
interaction_9am3pm Product of hillshades at 9AM and 3PM

We also looked at linear combinations of many of the higher correlated variables. This proved to be useful for improving the accuracy of one of our models.

Logistic Regression

We first investigated multinomial logistic regression. Since classical multiple regression in sensitive to high variance of coefficient estimates in cases of correlated predictor variables, we decided to use the glmnet package, which executes logistic regressions with regularization. Using the dataset x, the probability of the response class k of 1 through 7 is the following:

\[\mbox{Pr}(G=k|X=x)=\frac{e^{\beta_{0k}+\beta_k^Tx}}{\sum_{\ell=1}^7e^{\beta_{0\ell}+\beta_\ell^Tx}}\]

We then transfer this into the elastic-net penalized negative log-likelihood function (the first term of the following equation):

\[\ell(\{\beta_{0k},\beta_{k}\}_1^7) = -\left[\frac{1}{N} \sum_{i=1}^N \Big(\sum_{k=1}^7y_{il} (\beta_{0k} + x_i^T \beta_k)- \log \big(\sum_{k=1}^7 e^{\beta_{0k}+x_i^T \beta_k}\big)\Big)\right] +\lambda \left[ (1-\alpha)||\beta||_F^2/2 + \alpha\sum_{j=1}^p||\beta_j||\right]\]

Where k is the response class (cover type from 1 to 7) and N is the number of observations. The second term in the equation represents the regularization term.

The elastic net penalty \(\alpha\) varies from 0 to 1. When \(\alpha = 0\), the function reduces to Ridge regularization. When \(\alpha = 1\), Lasso regularization is used. The Ridge penalty shrinks the coefficients of closely correlated predictors wheras the Lasso tries to pick one and discard others.

Ridge Regularization

We performed 10-fold cross validation on a ridge regularization and on a grid of 100 values of \(\lambda\), choosing the minimum \(\lambda\) for each model. When we ran the regression with a 70-30 cross validation split on the training set, we got an accuracy on the test set on Kaggle of 55.70%.

When we reran the ridge regression with an 85-15 split on the training data, we got a Kaggle accuracy of 59.56%. Since the test set is so much larger than our training set, it was beneficial to incorporate a larger proportion of the data for learning within the training data before runnint the model on the larger testing data.

Lasso Regularization

A 10-fold cross validated lasso-regularized logistic regression kaggle score was 59.59% for 70-30 split kaggle score of 59.53%, for 85-15 split kaggle score of 59.44% if we took the lowest 50 percentile of lambdas and calculated the mode of the predictions.

Elastic-net Regularization

Using both the ridge and lasso regularization terms concurrently produces an elastic-net regularization. We ran this hybrid method with both a 70-30 split and 85-15 split in the training set. When running these models on the testing set, we got accuracies on Kaggle of 59.59% and 59.52%, respectively.

We also tried an elastic-net regression using all 15,120 observations in the training set with an \(\alpha\) parameter of 0.5. It earned an accuracy score on the testing set of 59.50%.

In summary, we were able to attain a best accuracy of 59.59% using lasso/elastic net and regularization. Whatever value \(\alpha\) takes does not have huge impact on model accuracy on this dataset.

Artificial Neural Networks

We first modeled the data in a feed-forward neural network with a single layer of nodes. Later, we created deep learning models where we added more layers of hidden nodes.

Single Hidden Layer Using the nnet Package

The nnet R library allows the construction of feed-forward single-layer nerual networks. After scaling our variables and running a basic model, we decided to tune the nnet function based on the following parameters - Number of nodes - Number of iterations - Maximum number of weights (since this depends on the number of nodes)

We experimented to see the optimal number of nodes for our hidden layer by checking our accuracy levels over various number of nodes.

We finally chose a 54-100-7 node neural nets on our training data achieving 52.11% accuracy for 300 maximum iterations and 52.43% accuracy for 500 maximum iterations.

Deep Learning using H2o

For adding multiple hidden layers, we used the deeplearning function using the H2o package. H2o is an R package that runs parallel-distributed machine learning algorithms to improve computational speed. Later, we also used H2o’s Gradient Boosting Machine (GBM) algorithm to acheive our second best accuracy rate.

Some features of H2o include:

  • In-memory compression techniques to handle large datasets
  • Automatic standardization
  • Randomized Grid Search
  • Fast algorithms deployed in Java
  • Anomaly detection to identify outliers

When we first ran a basic model, our validation results were poor and we decided to experiment with varying the default parameters using a grid search. We tried variations of one to three hidden layers with between 30 to 200 nodes each. We also varied the input dropout ratio, the fraction of training features to omit for improving generalizability. The next parameter was the rate, or how much the network adjusts over time. Finally, we tuned the rate annealing parameter that reduces the liklihood that the learning rate chooses local minima while optimizing. H2o allows for “hyper-parameter” tuning in the following format:

hyper_params <- list(
  hidden=list(c(30),c(50),c(70),c(100),c(130),c(160),c(200),c(250),c(30,60),c(60,90),c(90,120),c(120,150),c(30,60,90),c(60,90,120),c(90,120,160)), 
  input_dropout_ratio=c(0,0.05),
  rate=c(0.01,0.02),
  rate_annealing=c(1e-8,1e-7,1e-6))

From our first tuning run, the neural network had an accuracy of 60.19% with a 54-120-150-7 network.

After returning to the model, we decided to experiment with a wider range of the tuning parameters. Our best model was a 54-500-800-7 network with a learning rate of 0.02, rate annealing of 1e-05, input dropout ratio of 0, and an accuracy of 67.4%.

Notable studies from the past:

The earliest use of the dataset was in a doctoral dissertation. Jock Blackard and Denis Dean at Colorado State University used Linear Discriminant Analysis to obtain a 58% accuracy and Artificial Neural network to achieve 70% accuracy. The paper concludes by saying that ANN does have its drawbacks, it is still more viable than traditional methods, by which we think he refers to logistic regression, LDA and Support Vector Classifiers.

In 2007, Jain & Minz (Journal of Indian Agricultural Statistics) applied Rough Set Theory as an alternative to Machine learning approaches to achieve equally good but using lesser attributes (only 29) to achieve 78% accuracy for the forest cover dataset. This is especially relevant in real world classification applications in remote sensing involving large datasets.

In 2004, in Systems, Man and Cybernetics an IEEE International Conference publication, Xiaomei Liu, Kevin W. Bowyer of the University of Notre Dame use the forest cover dataset to conjecture something interesting. In their own words:

Based on these findings, we decided to move on from neural networks in favor of tree-based methods.

Tree Based Methods

Random Forest

Our initial submission of the Random forest algorithm without tuning any parameters improved our score to 68%.

The tuning was performed by varying the number of variables randomly sampled as candidates at each split.

Elevation clearly shows up as the most significant variable both in terms of mean decrease in Accuracy as well as contributing to reduction in node impurity.

After tuning the number of trees, we arrived an accuracy of 75.17%, our first benchmark.

After tuning the feature-engineered dataset we were only able to acheive around 71.93% accuracy.

Boosting Using XGBoost

The following steps were followed while implementing the XGBoost algorithm - An untuned algorithm performed slightly better than the untuned Random Forest with 70% accuracy - The tuning metric we used was the multinomial logloss
\[multinomial\ logloss = -\frac{1}{N}\sum_{i=1}^N\sum_{j=1}^My_{i,j}\log(p_{i,j})\]

where N is the number of observations, M is the number of class labels, loglog is the natural logarithm, \(y_{i,y}\) is 1 if observation i is in class j and 0 otherwise, and \(p_{i,j}\) is the predicted probability that observation i is in class j.

  • After running a 3-fold crossvalidation with 800 rounds, we estimated the logloss minimization occured on around 104
  • Our new model used the above parameters and improved the accuracy to 73.44%

Based on this information we tried another model for which we sequentially added the most important variables using all our variables

Our accuracy was 72.66%, lower than the one acheived using the original dataset.

Gradient Boosting

  • After some initial tuning of H2o’s Gradient Boosting Algorithm, an accuracy of 76.64% was reached
  • The Grid search tuning looked as follows:
  hyper_params_gbm = list(
  ntrees = c(40, 100, 200),
  max_depth = 20,
  min_rows = c(5, 10, 15),
  learn_rate = c(.01, .1, .3))

max_depth is the maximum depth (length of the longest path from node to leaf of a Tree) min_rows is the minimum number of rows to assign to terminal nodes

  • Hyperparameter tuning gives an impvroment to 78.05%.
  • Using a second set of feature engineered variables leads to a further improvement of 79.01%.

Extremely Randomized Trees

As mentioned earlier, the Extratrees algorithm includes another level of randomness to the mix by choosing cutpoints for each split in random while using the whole of the in-sample data to build the tree. The feature also allows us to choose the number of cutpoints.

A simple implementation of ExtraTrees in R , with 13 features tried at each node, 2 random cuts led an accuracy of 79.22%, our second best result.

The procedure for tuning the model was to first estimate an optimal number of trees , and then estimate the number of features considered for each split , and finally simultaneously estimate both number of cuts and node size to get the final model .

The model selected with the parameters mtry = 10 , ntrees = 200 , Node size = 2, Number of random cuts = 5 resulted in an improvement to our earlier model with an accuracy of 79.25%

Ensembling

Performance, Benchmarks and Accuracy

Learning and Improvements

Conclusion

The forest cover dataset is an extremely rich dataset for anyone who wishes to implement multi-class classification algorithms across a wide range of methods. It will continue to be a benchmark for classifying algorithms for some time.

Appendix

Models in bold denote highest test set accuracy for a specific model type

Classifier Feature Engineering Parameters Accuracy Rank
Logistic Regression - Ridge No 70-30 split 0.55696 1,489
Logistic Regression - Ridge No 85-15 split 0.59550 1,414
Logistic Regression - Lasso No 70-30 split best lambda 0.59594 1,411
Logistic Regression - Lasso No 85-15 split best lambda 0.59526 1,415
Logistic Regression - Lasso No 85-15 split,lowest 50 percintile of lambdas and calculated the mode of the predictions 0.59443 1,418
Logistic Regression - Elastic-net No 70-30 split 0.59588 1,412
Logistic Regression - Elastic-net No 85-15 split 0.59520 1,415
Logistic Regression - Elastic-net No using whole training set 0.59496 1,417
Random Forest No 80-20 split 0.68932 1,286
Random Forest No using whole training set 0.70748 1,210
Random Forest No ntree=500, mtry=13 0.75168 792
Random Forest No ntree=500, mtry=17 0.754 672
Random Forest Yes ntree=500, mtry=13 0.69115 1,282
Random Forest Yes ntree=500, mtry=20 0.71929 1,221
Random Forest Yes ntree=500, mtry=30 0.71831 1,133
Neural Network No nnet pack, whole train, maxit=300 0.50867 1,554
Neural Network No nnet package, whole train set, maxit=500 0.51487 1,549
Neural Network No nnet package, 80% training set, maxit=300 0.52113 1,545
Neural Network No nnet package, 80% training set, maxit=500 0.52432 1,541
Neural Network - Deep Learning No H2o package, 0.5946 1,418
Neural Network - Deep Learning No H2o package, 54-100-120-7 0.60186 1,403
Neural Network - Deep Learning No H2o package, 54-120-150-7 0.60186 1,403
Neural Network - Deep Learning No H2o package, 54-500-800-7, rate=0.02, rate-annealing=1e-05, rate-decay=1 0.67428 1,325
XGBoost No nround=50 0.70033 1,258
XGBoost No nround=104 0.73438 991
XGBoost Yes after 800 run 0.72949 1,040
XGBoost Yes cv run by sequentially choosing the 62 variables based on importance and chose the optimal set of parameters 0.72656 1,070
GBM - H2o No ntrees=250, max_depth=18, min_rows=10, learn_rate=.1 0.76640 467
GBM - H2o No ntrees=200, max_depth=20, min_rows=10, learn_rate=.1 0.76119 521
GBM - H2o No ntrees=200, max_depth=20, min_rows=10, learn_rate=.1, cv 80/20 0.74509 911
GBM - H2o No w/o stopping_rounds=2, stopping_tolerance=0.01, score_each_iteration=T, 10-fold 0.76586 479
GBM - H2o No w/o stopping_rounds=2, stopping_tolerance=0.01, score_each_iteration=T, 15-fold 0.78052 356
GBM - H2o No after reducing row sampling rate=0.4, column sampling rate=0.3 (roughly sqrt(n)/n), 9-fold 0.72656 1,070
ExtraTrees No mtry=13, numRandomCuts = 2, nodesize = 3, numThreads = 3, ntree=50 0.79220 224
ExtraTrees No mtry=10, numRandomCuts = 5, nodesize = 2, numThreads = 3, ntree=200 0.79247 224
ExtraTrees Yes mtry=13, numRandomCuts = 4, nodesize = 3, numThreads = 3, ntree=700 0.78739 308
ExtraTrees Yes tuned: mtry=56, numRandomCuts = 4, nodesize = 3, numThreads = 3, ntree=200 0.757000 613
Ensemble I No basic voting of RF, GBM, ExtraTrees and XGBoost 0.76787 448
Ensemble II No Feeding ExtraTrees to H2o XGB 0.79182 230
Ensemble III No Feeding ExtraTrees to H2o GBM then to XGB 0.79182 230

GitHub Repository

https://github.com/nycdatasci/bootcamp004_project/tree/master/Project4-Machinelearning/Aravind_thomas

References: