The image from URL: https://techxplore.com/news/2016-10-patterns-corrupted-model-fitting-technique-efficient.html


How to Analyze High-Dimensional Data


This article only describes the important steps of analyzing high dimensional/big data using machine learning techniques, which does not describe each concept in detail. There will be seperate blogs for each topic to go into details. I thought, it is important to have a summary of important steps of how we approach a buisness problem using machine learning techniques, what are the important factors we should focus on.

Note that, any high dimensional data \((HD)\) analysis techniques can be used to analyze big data as well as small data sets, which in general are analyzed using classical statistical techniques. The big data is a subset of high dimensional data. High dimensional data simply means when the number of variables \((p)\) is equal or greater than the number of observations \((n \approx p,\ \text{or} \ p >>>> n)\). Now, let’s go through the main steps of data analysis process of HD data.

(1) Define the Business Problem

At the very beginning, the first thing you should start with is to define the buisness problem answering Key Performance Indicators (KPIs) which measure the performance of the reaching project goals. The KPIs will depend on the department, specific to the buisness goals and targets. It is really benefial to find out correct KPIs to your department or organization. Learn more about KPIs https://einsights.com/key-performance-indicators-kpi/.

Once you define the right KPIs, it is time to think about the data.

While the small data files is often stored in series of spread sheets, the large data files will be stored in databases like, Hadoop, data warehouse, or cloud sources. Once you have access to utilize data, you can extract data using the relevant query languages, SQL, OCI, Oracle etc.

Suppose that, you have collected required data related to the buisness question. What next?

(2) Explore the Data Set


As a first step, you should explore the data set before even deciding on any analysis techniques. Study the variables, dimension of the data set (ie. number of observations and number of features (predictors)), class of the features (ie. continuous or discrete features), status of the outcome variable (if we have a outcome label) etc.


Next thing would be to obtain descriptive statistics of the data set, examine the mean, mode, standard deviation, outliers, and percentage of missing data for each variable. Ex: mean for continuoues features and median for discrete variables. (Of course ! if your data set has thoudands of features, it is not a good choice to look at descriptive statistics of each variable).



(3) Data Cleaning and Pre-processing


At this stage, you will prepare the data set for the analysis. Very often, we have to deal with missing data, multicollinearity, skewed data, outliers etc. Also, if you are very sure about the redundant variables, it would be better to exclude these variables, specially when you deal with high dimensional data.

Missing Values

  • What we should do if the missing data is important to consider for the analysis, meaning that if missing data really provides some information and related to the outcome or seems to be associated with other features in the data set. In this case, we cannot exclude the missing values from the analysis.

  • How to handle missing data? First, we should study the missing mechanism, is it Missing At Random (MAR), Missing Not At Random (MNAR), Missing Completely At Random (MCAR), etc. Go to https://www-users.york.ac.uk/~mb55/intro/typemiss4.htm to learn more about missing mechanism in detail. Simply, when missing values have a certain pattern but it is not related with the reponse variable (label), known as MAR, but when missing values with a certain pattern and it is related to the label variable, known as MNAR, when missing values has no pattern and not related to the label variable, knows as MCAR.



(3.1) Handling Missing Values (NA)

If possible, recovering NAs would be a better option to handle NAs, if not, there are different ways to handle missing values depending on the missing mechanism.


1. Case-wise (like-wise) deletion

As described by the title, it deletes the entire row if at least one observation is missing. This is not a good choice when we deal with real world data sets. There is a higher chance to exist at least one missing observation in at least one feature. If we decided to use Case-Wise deletion method, we might have issues with the sample size as well as loss of information.


2. All possible value imputation

This method simply replaces NAs of certain variable by all possible values of that variable. As an example, suppose that the data is missing for gender variable which has two categories: male and female. In this case, we can replace missing observations by male and female based on the disctribution of gender variable.


3. Mean-Median imputation

Mean-Mode imputation can be used when the data is missing at random. This method imputes NAs by using mean for numerical variables and median for categorical variables based on available observations.


4. Multiple imputation

This is another most commonly used imputation methods to handle NAss when it is MAR. The multiple imputation method imputes missing data considering the model that incorperates random variation. Repeate this step many times, and perform the data analysis on each data set using imputed data for missing. Take the average over all the models to obtain the final output.


5. Expectation Maximization (EM) algorithm

One of the most popular methods to deal with missing data when the missing mechanism is MAR. The process consists of two steps E-step and M-step. The missind data is replaced based on iteratively imputed values considering mean and covariance among the variables.


6. Hot-Cold deck imputation

The data set will be clustered into different clusters, and compute the mean for each cluster. The missing values will be replaced by the mean of relevant cluster mean.


7. Regression substitution method

Imputation is done assuming the variable with missing data is linearly related with other variables in the data set. As an example, we can use training, test split method to fit the regression model and predict values for NAs. Meaning that, training data set will include non-missing values and testing data set will include missings. Then, build the model using training data and make predictions for testing data, which will provide the values for NAs.


8. K-nearest neighbour (knn) imputation

This methods deals missing data, replacing them based on K-nearest neighbour algorithm.


9. Create a category for missing values if the feature is discrete and include that category when building the model.



What if the outcome variable is missing ?

There are two ways to handle when the response variable is missing;

  1. Missing category approach

This method involves with creating a category for the missing values based on the distribution of missing data and the available observations.

  1. Create a surrogate variable



(3.2) Handling outliers, Transformation

Outlier observations cause significant problems to the model outomes, providing incorrect, bias, irrelevant predictions. Outliers could occur due to several reasons, like data entry errors, measurement errors, sampling errors, and data processing errors etc. There are different ways to detect outliers in low dimensional data sets, using standard deviation or Tukey’s method, multivariate methods (using scatter plot of y vs. x, Mahalanobis distance, and Minkowski error). The most common methods used in HD data to detect outliers are clustering and isolation forest (based on Anomaly score). Learn more about outlier detection methods (https://machinelearningmastery.com/how-to-identify-outliers-in-your-data/).

** The very first steps of handling outliers would be removing them, try to assign new values, Cap outlier data using underlying distribution.** After handling with outliers, next important step would be to check the scale of data, very often our data does not come in same scale, it requires transformation prior to any kind of analysis.


Scaling and Centering Data (Normalization of Data)

Preprocessing of the data set includes dealing with outlier observations and perform any modifications to the data set. Handling outliers is very impotant before move into the analysis stage. One way is to perform centering and/or scaling of the observations.


Centering of a predictor is done substracting the predictor value for each individual by the mean of the predictor. (\(x_i - \bar{x}\)) Centering will end up with mean value of zero for that predictor.


Scaling is simply deviding by some constant, in predictor scaling is done deviding predictor value by it’s standard deviation.


Performing both centering and scaling is called normalizing data. This is one method of transforming data.


Transformation methods

Some of the most popular data transformation methods would be log, square root, and inverse transformation. As an example, if you are going to use log transformation to remove the skewness of the data set, you replace predictor values by log of that value.


The Box-Cox transformation is very powerful technique for handling skewed data, and also transform non-normal continuous response variable into normally distributed data. Here, I only focus on Box-Cox transformation to handle skewed data. Unfortunately, most of the time, we have to deal with messy data set to obtain an outcome. Very often, the will be right skewed (if the more data points are densed to the left) or left skewed (if more data points are densed to the right) data, hence it is very important to have an idea about how to deal with skewed data as well.


The magnitude of skewness is measured based the predictor value, mean and number of observations. If the ratio between the highest value and the lowerst value of a predictor is greater than 20 it is considered to be significantly skewed.

Box-Cox trnasformation:

\[ x^* = \left\{ \begin{array}{} \frac{(x^{\lambda}-1)}{\lambda} \ &if \lambda \ \neq 0 \\ \log (x) \ &if \ \lambda = 0 \end{array} \right. \]

When,

\(\lambda = 2 ==>\) square transformation,

\(\lambda = 0.5 ==>\) square-root transformation, and

\(\lambda = -1 ==>\) inverse transformation


One more important step in data pre-processing would be re-categorizing variables based on it’s distribution, or create new variables as needed. As an example, suppose that the data is given for height and weight, but depending on your research question, it might be more approrpiate to consider Body Mass Index (BMI) for the analysis instead of weight and height. So, the BMI shoudl be computed and introduce a new column to the data table.


As well as, there might be a situation where you have to discretize the continuous variables. It is very important to get your data set pre-processed before going to the analysis part.


Once the pre-processing and data cleaning is completed, now you are ready to start analyzing the data. For high dimensional data, based on the outcome variable and your buisness problem, you have to decide which machine learning algorithm you are going to use.

(4) Model Building

(4.1) Deciding the technique or algorithm

There are two main techniques in machine learning to deal with big data analysis:

1. Supervised Learning : is used to predict or find any association between the outcome variable and the predictors.

2. Unsupervised learning : no outcome label is introduced, but you want to detect any kind of association among variables in the data set.


Classification and regression (CART) techniques (single decision tree, bagging, random forest, and boosting techniques) could either be used for regression and also for classification.


Reinforcement learning is another important paradigm in ML, which enables to handle many challenging tasts like playing games, self driving cars, teaching a robot to do human tasts etc.

However, I will only list supervised learning techniques since this article mainly focus on steps of building a predictive model.

Supervised learning has two main areas: if the response variable is continuous, we will use regression techniques and if the response variable is categorical, we will use classification techniques.



Classification methods:

  • Support vector machines (SVM)

  • K-nearest neighbors

  • Logistic regression in High Dimensions

    1. Variable pre-selection

    2. Forward stepwise logistic regression

    3. Lasso logistic regression

    4. Ridge logistic regression

    5. Priciple components logistic regression

  • Discriminant methods (Classifier that relies on Bayes rule, Naive Bayes, Markov blanket)

  • Neural network

  • Regression trees

  • Bagging

  • Random forest

  • Boosting

The last four techniques can also be used for continuous response data.



Regression methods:

  • Variable Pre-Selection

  • Forward stepwise regression

  • Lasso regression

  • Ridge regression

  • Priciple components regression

  • SVM for regression

As I stated above, now it is time to decide on a machine learning algorithm to deal with your data set. Suppose that your outcome variable is categorical. In high dimensional setting, majority would start with regression trees or SVM to deal with categorical outcomes.

Again, it is very important to know about the data set well. If the data set has millions of observations and not many variables, it would be better to try a random forest model, but if there are lots of variables with large number of observations, RF would not be a good choice to try. Instead, first try to reduce the dimesionallity using clustering, or LASSO regression if the outcome is linearly related with predictors in the data set, then try boosting technique or other relavent method to build the model. It is really important to remember that, very often you will have to try or combine different models/algorithms to build a model that provides a better performance on predictions.

Suppose that, you decided which algorithm to use, What would be the next important step once you decided which algorithm to use?

(4.2) Split Data, Test Error, Training Error

Remember that we are going to develop a predictive model using available observations. Which means we will be using our final model to make predictions on un-seen data. So, it is very important to check how well our model performs on a new data set.

If we use our whole data set and fit the model, our model will perform well on the data that we used to fit the model, but it will perform very badly on a new data set. Thus, we should not use our whole data set to fit the model, meaning that we will split our data set into three subsets; training set, validation set, and test set.

First, we only use the training set (Ex: 60% of data) to fit the model, then we use validation set (20% of data) to evaluate the model and finally, we use the test set(20% of data) to see the performance of our model. If you do not have large enough sample size to split into three sets, it would be better to split the data set into two sets; training set, and validation set. Using two sets, the best model is obatined based on the model that gives the lowest test error. Finally, you can use entire data set to fit the model.

Hm …. why test error ???

Well, our goal is to develop a predictive model, but we should take into account another important factor when developing a predictive model, it shoud not be too-complex or too-simple. What does that means? There are lots of variables (features) in the data when we deal with high-dimensional data. How many of these features do really important when making predictions for the outcome?


There might be many noisy, redundant variables in the data set, so we have to select only the variables that are important to keep in the model. Meaning that we have to remove all the noisy variables from the model, is called the dimension reduction. Dimension reduction lead to obtain a not-too simple and not-too complex model.


What is complexity of a model? The complexity of a model depends on several factors, such as number of variables in the model, interaction terms, polynomial terms etc. What happens when we end up with a model that has many variables? then the model is too complex. When the model complexity increases, the bias will decrease and the variance will increase. We want to obtain a model that gives best trade-off between bias and variance.

\[ Error = Bias^2 + Variance \]

Now, let’s see what is the relationship between model complexity and the training error and \(R^2\) value. What happens when we increase the number of variables in the model? the training error decreases and \(R^2\) increases, but test error decreases upto a certain point and after it starts increasing. Hence, choosing a model based on test error is similar to choosing a model that gives the best trade-off value between bias and variance.

\[\begin{equation*} \text{Mean Sum Squared Test Error} = \frac{1}{N}\sum_{i=1}^N(y_i^{test}-\hat y_i^{test})^2 \end{equation*}\]

where \(i = 1,...,N\) number of test observations.

Training Error Vs. Test Error

Note: This figure is taken from Hastie, T et al.(2001).

Now, we know that we will select the model based on validation set error (test error if you split your data set into two sets). How to estimate the test error ?


There are three ways to estimate validation set error;

  1. Validation set approach

  2. Leave-One-Out cross validation

  3. K-fold cross validation


Suppose that, we obtained the best model as described above. Now, let’s focus on how we check the model performance.



(5) Model Performance Vs Model Accuracy


How to decide that the model we obtained performs well on predictions? Before checking how well the model predicts, we shuold look at how robust is the model using AIC/BIC, AUC-ROC, AUC-PR, Kolmogorov-Smirnov etc. Unlike in linear regression, evaluating model accuracy of a classification model is more complex.


In classification problems, we can consider the following terms to examine how well our model performs, called Key Metrics;

\[ MCC = \frac{TP*TN-FP*FN}{\sqrt{(TP+FP)*(TP+FN)*(TN+FP)*(TN+FN)}} \]


In regression problems, we can use



Confusion Matrix
Predicted
True Yes No
Yes True Positives (TP) False Negatives (FN)
No False Positives (FP) True Negatives (TN)


Now, let’s use the confusion matrix to evaluate model performance of a classification model considering the important key metrics, stated above. All these terms are very important when testing model performance. Can we only look at accuracy to see how good the model performs ? NO . Let’s dig into more details.

\[ \text{Classification Accuracy}= \frac{TP+TN}{TP+FP+TN+FN} \]

Let’s see why considering only classification accuracy is not a good choice to check the model performance. Consider the following confusion matrix.

Confusion Matrix
Predicted
True Yes No
Yes 85 6
No 4 105

\[ \text{Classification Accuracy}= \frac{85+105}{85+6+105+4}*100 = 95\% \]

We have the classification accuracy of 95%, but is there any other way to get the classification accuracy of 95%. Let’s look at the following case.

Confusion Matrix
Predicted
True Yes No
Yes 5 6
No 4 185
\[\begin{equation*} \text{Classification Accuracy}= \frac{5+185}{5+6+185+4}*100 = 95\% \end{equation*}\]

Ooooh! Now, we can see that the classification accuracy does not tell you how well our model performs. Because, the model misclassified 80 observations into “No” group, but they should really be in “Yes” group. This could be mostly happen with imbalanced data. I will list the methods you shoud consider to deal with imbalanced data to the very end of this article.

How should we really check the model performance then?

Considering “Recall” and “Precision” to check the model performance would be a good choice. How actually, these two terms contribute to the model performance?

Recall or sensitivity is the proportion of correctly predicted true observations (TP) out of all true observations in the actuall class (TP+FN). So, we need our model to reduce the false negatives and increase true positives.

Then, what about the precision ? Well, the precision is the proportion of correctly predicted true observations (TP) out of total number of true predictions (TP+FP). In order to have a better precision, the model should give us lower number of false positives.

Reducing false negatives and false positives will automatically improve the precision and recall, but there are situations where false negatives and false positives are important as well.

When false negatives and false positives are important?

Ex. 1:

Suppose that, we developed a model to predict that the train automatically stops or not when the driver does not slow down and prepare to stop at a certain distance, observing objects in the coming way. Consider a situation where the train really stops, but our model predicts as it will not automatically stop (FN). It is ok to predict as the train will not stop when it actually stops (FN) than predicting it will stop when it actually does not stop automatically to prevent accidents (FP).

Ex. 2:

suppose that we fit a model to predict the cancer occurrence, the outcome will be “Yes, there is a chance to have a cancer” and “No, no risk of a cancer”. Suppose that, an individual truly has a cancer, but the model predicts as “no cancer(FN)”. In that case, it is really bad, because the physician will send that patient home, without sending for cancer screening, and that person will die soon or will diagnose cancer at later stage. In this situation, it is ok to predict that the person is more likely to have a cancer even though he does not (FP) than predicting no cancer when that person really has a cancer (FN).

So when checking the model performance we should consider the precision and recall both, taking best trade off between these two.

(5) Deploy the Model


Suppose that you have obtained a better model for your buisness problem. You are not done yet! Once you have developed the predictive model, you should deploy the model depending on the requirement of your organization. It could be preparing a dashbaord or implementing a reproducible process such that your model runs well with the future data.


References

  1. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). New York: springer.

  2. Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1, pp. 337-387). New York: Springer series in statistics.

  3. Kuhn, M., & Johnson, K. (2013). Applied predictive modeling (Vol. 26). New York: Springer.

  4. Martin Bland (August, 2015). Types of missing data. https://www-users.york.ac.uk/~mb55/intro/typemiss4.htm



********************************************-********************************************

By: Nirosha Rathnayake, Ph.D. Biostatistics (ABD), UNMC, Omaha, NE

********************************************-********************************************