In supervised learning, we typically have access to a set of p features of n observations, and a response Y measured on the same observations.
Unsupervised learning is a set of methods that do not have an associated respisen variable Y. These methods are used to answer questions like “Can we discover subgroups among the variables?”. Here we focus on two techniques…
In general, unsupervised learning is much more challenging than supervised learning as it is more subjective. There is no simple goal for the analysis, such as predicting the response. Unsupervised learning is often used as part of an exploratory data analysis. Further, it is harder to asses the accuracy of the results obtained, since there is no universally accepted method for cross-validation or validation. Simply put, we cannot really check out work in an unsupervised setting, beyond simple intuition or theoretical knowledge of the process at hand. Nevertheless, there are many uses for unsupervised methods:
Understand cancer behavior by identifying subgroups of patients.
Websites (particularily e-commerce) often try to recommend product to you based on your previous activity.
Netflix movie reccomendations.
I’ve already talked about principal component analysis when I did principal components regression here. When presented with a large set of correlated variables, principal components allow us to summarize the set into a smaller number of representitive variables that collectively explain most of the variablility in the original set.
Principal component analysis (PCA) refers to the process for which principal components are computed, and the subsequent use of the components in understanding the data. PCA also serves as a tool for data visualization.
Suppose we wish to visualize n observations with measurements on a set of p features for part of an exploratory data analysis. We could use scatterplots of the data in two-dimensions, however, there are potentially a lot of scatterplots you would have to create if p is large; \(p(p-1)/2\) to be exact. If p = 10 there are 45 plots. Clearly we need an alternative when p is large. Specifically, we want to find a low-dimensional representation of the data that captures as much information as possibe. PCA provides a method to do this. PCA seeks a small number of dimensions that are as interesting as possible, where the concept of interesting is measured by the amount that the observations vary across the dimension. Each of the dimensions found by PCA is a linear combination of the p features, so this is technically not a form of feature selection. Also note that before computing PCA the data must be centered and have a mean zero (unless all the data are of the same unit).
For an explanation on how principal components are found, please refer here.
We can also measure how much information is lost by utilzzing principal components. To do this we can compute the proportion of variance explained (PVE) by each principal component. It is generally best explained as a cumulative plot, such that we can visualize the PVE for each component and the total variance explained. Once we have this measurement, we can start to conclude if the principal components explain enough data to provide an accuracy summary.
In general, we would like to use the smallest number of principal components required to get a good understanding of the data. However, there is no single threshold that answers this question. Arguably, the best way to do this is to visualize the data in a scree plot, which we will demonstrate later. It is simply a plot of the cumulative PVE. Similar to how we select the most optimal tuning parameters of other learning techniques, we examine the scree plot to see when the percent variation drops off, such that addition prinicipal components don’t really add a significant amount of variance. This is sometimes referred to as the elbow. We can use this technique, combined with some understanding of the data. If we can verify that that a component is contributing some specific peice of information, we may be more willing to include more prinicipal components.
Remember that most statistical approaches can adapt to using principal components as predictors, which can sometimes lead to less noisy results.
We will be performing PCA in the USArrests dataset, which is part of base R. The rows of the dataset contain 50 states.
states <- rownames(USArrests)
states
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
The columns of the dataset contain four variables.
names(USArrests)
## [1] "Murder" "Assault" "UrbanPop" "Rape"
Let’s explore the data a little.
kable(summary(USArrests))
| Murder | Assault | UrbanPop | Rape | |
|---|---|---|---|---|
| Min. : 0.80 | Min. : 45 | Min. :32.0 | Min. : 7.3 | |
| 1st Qu.: 4.08 | 1st Qu.:109 | 1st Qu.:54.5 | 1st Qu.:15.1 | |
| Median : 7.25 | Median :159 | Median :66.0 | Median :20.1 | |
| Mean : 7.79 | Mean :171 | Mean :65.5 | Mean :21.2 | |
| 3rd Qu.:11.25 | 3rd Qu.:249 | 3rd Qu.:77.8 | 3rd Qu.:26.2 | |
| Max. :17.40 | Max. :337 | Max. :91.0 | Max. :46.0 |
We can see the data have very different means and variances. Further, the variables are measured on totally different scales. For example UrbanPop is measured as a % and the number of rapes is measured per 100 000 individuals. If we didnt standardize the data, we would be in trouble.
Perform PCA using the prcomp() function. the rotation matrix provides the principal component loadings.
pr.out <- prcomp(USArrests, scale=TRUE)
kable(pr.out$rotation)
| PC1 | PC2 | PC3 | PC4 | |
|---|---|---|---|---|
| Murder | -0.5359 | 0.4182 | -0.3412 | 0.6492 |
| Assault | -0.5832 | 0.1880 | -0.2681 | -0.7434 |
| UrbanPop | -0.2782 | -0.8728 | -0.3780 | 0.1339 |
| Rape | -0.5434 | -0.1673 | 0.8178 | 0.0890 |
We see that there are 4 distinct components, and we can already stard to identify what is being represented in each one. For example, the first component seems to explain the divergence of crime related information to urban population. This also makes a lot of intuitive sense that this is the first component, since it is the largest divergence in variance. The second component certainly explains the effect of the urban environment, and the third and fourth components show the divergence of rape to the other two crimes, murder and assault.
We can plot the first to principal components.
The Biplot
biplot(pr.out, scale=0)
Here we can see a lot of information. Firstly start by looking at the axis, PC1 on the x and PC2 on the y. The arrows show how they are moving across the two dimensions. We see that the three crimes are moving along the first PC while the urban population moves mostly across the second PC. The states colored in black show how each state varys across the PC directions. For example, California has both high crime, and one of the highest urban populations.
The $sdev attribute outputs the standard deviation of each component. The variance explained by each component can computed by squaring these:
pr.var <- pr.out$sdev^2
pr.var
## [1] 2.4802 0.9898 0.3566 0.1734
Then to compute the proportion of variance explained by each component, we sinply divide it by the total variance.
pve <- pr.var / sum(pr.var)
pve
## [1] 0.62006 0.24744 0.08914 0.04336
Here we see the first PC explains about ~62% of the data, and the second PC explains ~ 24%. We can also plot this information.
The Scree Plot
par(mfrow=c(1,2))
plot(pve, xlab='Principal Component',
ylab='Proportion of Variance Explained',
ylim=c(0,1),
type='b')
plot(cumsum(pve), xlab='Principal Component',
ylab='Cumuative Proportion of Variance Explained',
ylim=c(0,1),
type='b')