The data set being used is the leukemia data. The data set comprises of 22283 variables (genes) and sixteen samples (patients).
Principle Component Analysis (PCA) is a common unsupervised learning technique used to reduce data dimensions. Some of the benefits for dimension reduction on a data set are:
Applying PCA to the data set transforms each variable into principle components that are uncorrelated. With PC1 explaining the most variation in the data, PC2 then account for as much of the remaining variation as possible etc. We will be using the packages, factoextra and FactoMineR to perform our analysis.This package was chosen as it can handle data where the number of variables exceed the number of observations. The initial data transformation that we perform on the data frame before any analysis is to transpose the data. the t() function is used to perform this task.
In this section we use PCA on the original data set without any further transformations. The results of the analysis are shown in the table below. In this approach, the first principle component explains 36.60% variation in the data. Both the first and second principle component together explain 55.10% of the variation in the data.
| eigenvalue | percentage of variance | |
|---|---|---|
| comp 1 | 22884853700 | 36.63041 |
| comp 2 | 11273873284 | 18.04541 |
We now set the scale argument to be equals to TRUE in our PCA() function. This scales the data. With this approach, the first principle component explains 32.6% variation in the data. Both the first and second principle component together explain 44.42%.
| eigenvalue | percentage of variance | |
|---|---|---|
| comp 1 | 7263.530 | 32.59673 |
| comp 2 | 2635.949 | 11.82942 |
Now, we set our PCA() function scale argument to FALSE and apply the log() function to our data. The portion of variance explained by the first component is in less than the two above methods, 28.38%.
| eigenvalue | percentage of variance | |
|---|---|---|
| comp 1 | 2284.9398 | 28.384540 |
| comp 2 | 725.6815 | 9.014739 |
Now, we set our PCA() function scale argument to TRUE and apply the log() function to our data. The portion of variance explained by the first component is 29.90%.
| eigenvalue | percentage of variance | |
|---|---|---|
| comp 1 | 6663.806 | 29.90533 |
| comp 2 | 2580.109 | 11.57882 |
In the above exercise, we performed PCA on the data using different data transformation approaches: 1. Original data, no transformation 2. Standardized data 3. Transformed original data using the log_transform 4. log_transformed and standardized data
In this section we compare the scree plots of each method.The scree plot visualizes the importance of the PCs.
For the original data: It is clear that the first principle component explains 36.6% of the variation. From the scree plot we can see that using 5 principle components will be adequate for our analysis, any additional components will not contribute to significant improvement in our model performance.
For the standardized data: From the scree plot we can see that using 4 principle components will be adequate for our analysis, any additional components will not contribute to significant improvement in our model performance.
For the log transformed: From the scree plot we can see that using 4 principle components will be adequate for our analysis, any additional components will not contribute to significant improvement in our model performance.
For log transformed and scaled: From the scree plot we can see that using 4 principle components will be adequate for our analysis, any additional components will not contribute to significant improvement in our model performance.
For each method, when we calculate the total sum of the variance explained we get:
Standardized data will give a total sum of variance explained that is equal to the total number of observations.
If dimension reduction is our goal the most optimal number of PCs was achieved using approach one (see scree plot). However, one must consider that without normalizing the data, the variable with the highest variance will dominate the first principle component. This becomes a problem if the data was not collected using the same scale. We observed different results between normalizing the data and not normalizing the data. This provides us with evidence that the variability was not the same for all our data, therefore a fair comparison was not achieved.
Transforming the data using the log function is useful when wanting to reduce the impact of outliers in your data and linearize the relationship between variables.linearizing the relationship between variables is important as the PC algorithm tries to find the best linear combination of the variables.
Given the varied performance of our PCA on the different transformation methods, this provides us with evidence that noise does exist in the data. Standardizing the data improved the performance of PCA by ensuring a fair comparison between variables. It is difficult to say whether more noise in the data was removed because of the log transformation included. Scaling the data alone captured more variation in the data.
We now extend our analysis to using the top 100 genes with the most expression levels across the samples. we make use again of the factoextra and FactoMineR packages for this. Our fist task is to determine what are the top 100 genes.
In order to do this we do the following: - We perform PCA on the scaled data - We then call the function and use res.pca\(var\)contrib to obtain the % contributions of variables on the first 5 principle components - We isolate the data frame to only include the first principle component observations - We sort the % contribution from largest to smallest - we use the first 100 samples as our most important gene expression levels
The table below shows the contribution % of the top 6 variables on PC1.
| V1 | V2 | |
|---|---|---|
| 3892 | 204365_s_at | 0.0135325 |
| 5421 | 205894_at | 0.0133844 |
| 6292 | 206766_at | 0.0133496 |
| 4069 | 204542_at | 0.0133419 |
| 5658 | 206132_at | 0.0132217 |
| 14730 | 215356_at | 0.0131796 |
The figure shows how the contribution % of the to 100 variables. It it clear from the figure that after 100, we will not achieve any significant increase in the cumulative %.
The figures below show the results obtained from the PCA of the top 100 gene variables, in the form of scree plot and PC plot for the individual observations.
The first principle component explains the majority of the variation in the data, 94.04%. This is very large compared to the other methods discussed above. We can successfully visualize the first and second principle component in a two dimensional space.From the above plot (individuals PCA) we can see evidence of some clusters in the data. We will explore this further in the next section.
The goal of principal component analysis is to find the best low-dimensional representation of the variation in a data set. More specifically a date set with the many variables. PCA aims to further remove the noise in the data. Initially reducing the dimensions of our data to only include the top 100 variables of interest,improved the performance of the analysis achieving 94% of variation explained by the first principle component.
In this section of the assignment we will demonstrate how to use clustering as an exploratory tool.
The first method that we explore is kmeans. We use the elbow method to determine the most optimal variable for k.
From the plot below the value of k should be set to 3. We however have 2 groups.We first use k =2 and measure how well the methods distinguishes between the groups.
In order to assess the performance of the kmeans method we use a confusion matrix to determine how many patients have been classified correctly. The confusion matrix gives us an accuracy of 75%. We illustrate the results using the plots below.
In the right plot, we can clearly see an overlap between clusters. Left plot, we can see that 4 patients belonging to predefined category 2 have been classified to having good leukemia prognosis. let us use K = 3 as the elbow method suggested to see how the groups are distinguished.
## Too few points to calculate an ellipse
Patient 2 has been classified into a different group, could be an indication of a unique case. This could be evidence that leukemia prognosis can have three levels and not just poor or good. We can see that the study groups, patient 3,4, 5 and 6 to having good prognosis. However, these three patients could have potentially initially been classified and need to be look into further.
We will now use Hierarchical clustering in indentification of groups. We will initially set the distance method as euclidean and compare the following clustering methods:
We will use dendrograms to compare.
The next analysis will invlove varying the distance methods used. We will set the agglomeration method to be single.
We execute the function hclust() to perform the clustering. We first compare the deprogram of each of the methods. This is depicted in the figures below. One can clearly see that all three methods produce the same results in terms of the clusters. However each dendrogram is shaped differently.
From the dendrograms, we can see that all patients falling into the good leukemia prognosis have been classified correctly. However this method fails to identify patients falling into the falling into the poor leukemia prognosis accuracy 50%. This is similar to the results obtained with the kmeans algorithm.
When comparing the misclassification error, all methods have a misclassification error of 25%.
The choice of dissimilarity measures is very important and can have an influence on the resulting dendrogram. In this section we explore how it will influence the results of our data.
We run our model using the following two methods:
The top two figures below are the resulting dendrogram. The two figures do not look the same. Patient 2 is still remains the outlier. The bottom figures display the results after scaling the data. It is clear from the figures below that the using the maximum dissimilarity measure is more sensitive to noise.
We will now compare the different clusters.We can clearly see from the dendrogram, that although the dendrograms appear different, the clusters are the same.
In this section, we used unsupervised clustering methods, kmeans and hierarchical clustering on our data set. We further varied the distance methods used on the hierarchical clustering and the agglomeration method to see if we will achieve different results. We noticed that the using maximum distance to measure dissimilarity was the most sensitive to data that has not be scaled. All methods clustered the data identically. We can therefore confirm that the results we obtained are not by chance. The principle components are a consistent distinguish er of the groups.
We however noticed that our unsupervised learning methods identified 3 clusters of patients as opposed to the original two that had been given. In practice, this can be an indication of incorrect classification of a diagnosis which can lead to further patient examinations to be taken. We also identified a unique case. These techniques therefore have the advantage on telling you new information about people and things that we did not know.
Self-organizing maps (SOM) have been successful at addressing many different types of problems. This is particularly true in the context of clustering tasks (Kohonen 2001; Bacao, Lobo et al. 2005).
In this section we demonstrate the use of SOM and MDS to segment young people into groups based on preferences, interests, habits, opinions, and fears.
Given a number of interest such as Mathematics, Science, Music etc, we would like to find out if people make up any clusters of similar behavior.
We explore the preferences, interests, habits, opinions, and fears of young people using the young people survey data from Kaggle.The data that we are going to use consist to 1010 observations and 150 variables. We remove the label variables and use the first 144 for our analysis. The data set does consist of missing values which we omit and remain with 686 observations.
We construct our SOM grid 15 x 10 hexagonal. We set the iterations to 200. We check the adequacy of out decisions using the changes and count plots. Further, our goal is to investigate if we can group people into groups. To achieve this we first identify how many cluster will be optimal.
## Warning: did not converge in 10 iterations
Total within-clusters sum of squares, indicates that 6 clusters will be adequate.The changes plot indicates that the 200 iterations are adequate.
In this section we will use SOM to produce some data visualizations that will help us assess the quality of our model.
The counts plot visualizes the number of cases that have been mapped to each node (cases per node).We can can see that this is nearly equally distributed with some exceptions. Next to this plot is the the “U-Matrix”, this visualization the distance between each node and its neighbors. In this plot we have two red, yellow nodes, indicating the existence of a cluster.
We use heat maps to explore more information about the clusters within our data. In particular, we would like to understand how different groups responded to specific questions. The theme we are interested first is understanding is health.
We can immediately see that the the left cluster is categorized by people with a much higher weight. Same with the left cluster can also be categorized as people that are taller. There is a correlation, short people generally weight less than tall people.
We can as well see a correlation between being healthy and spending on healthy food.
We will now explore the interest in Science and Mathematics.
There is more of a correlation between clusters with regards to the interest in Math and Physics than with Chemistry.
In this section, we demonstrated the usefulness of SOM in data exploration in a high dimensional setting. SOM provided an efficient and effective means of obtaining information about the data and groups within the data. SOM permits the identification of clusters that otherwise would pass unnoticed more specifically in high dimensional space.
We were able to explore categories of interest and to see how the model had clustered the groups based on these categories.
We will now explore multidimensional scaling and compare the clustering results. We will explore the following 4 methods:
The figures below show the results for each method.
Classical scaling: Distinguishes our data into three clusters, however this is not visually identifiable. Non Matrix: From the figure above, this method clearly separates the outlier from the sample i.e the third cluster. Sammon: The data points appear to be much closer together. Difficult to see the different groups. Kruskal’s non-metric MDS: Here we get similar results like that of the Non Matrix scaling method.
Overall we explored the importance of scaling multidimensional data. Both to improve clustering performance and data visualization. As we expected, different scaling methods produced different results and impact the reliability of them thereof. It is therefore important to test you data with different methods to ensure consistency in the output that you receive.