Some follow-ups on PCA.

Scree Plots.

As I said last time,, I think it’s important that the method works out of sample. Anything will turn up arcs, and this method is effectively a way of searching for them. But scree plots let us see whether those arcs are actually capturing most of the variance in the data.

Now I am readjusting the scree plots to ask what percentage of the variance so far each successive component explains. The first component–linear motion–explains about 80% of the variation. That’s significantly higher than the outliers of 1,000 runs.

Even more interestingly, the second component–distinguishing beginnings from ends–explains more than 80% of the variance still left. That’s far more remarkable compared to the random-walk distribution.

In both, because I’m overfitting, the sixth component explains nothing, and the fifth component explains everything remaining. The real data looks like it’s over 1.0–actually it’s a division error because so little is left.

So as before. I run 1,000 random samples, and see how their scree plots compare to this one.

variance = function(p) {p$sdev^2 / sum(p$sdev^2)}
cumVariance = function(p) {
  var = variance(p);
  # The amount explained starts at zero, and is the sum of previous values.
  explainedSoFar = c(0,cumsum(var))
  # The last entry will always be one, because it tells us how much is explained after everything;
  # Drop it.
  
  explainedSoFar = explainedSoFar[-length(explainedSoFar)]
  #How much of the existing variation does this one explain?
  explained = var/(1-explainedSoFar)
}

random = as.data.frame(t(replicate(1000,cumVariance(randomData()$PCA)))) %>% tidyr::gather(component,variance,V1:V6,convert = T) %>% mutate(component = as.numeric(gsub("V","",component)))

ggplot(random,aes(x=component,y=variance,group=component)) + geom_boxplot() + geom_point(data=data.frame(variance = cumVariance(masterPCAArcs),component=1:6),color="red",size=4) + labs(title="The first two PC explain much more of the unexplained variance\n(red) than 1000 random samples (boxplots)")
## Warning: Removed 781 rows containing non-finite values (stat_boxplot).

plot of chunk unnamed-chunk-2