Principal Component Analysis


PCA is a transformation of your data. It attempts to find out what features explain the most variance. For example:

> import pandas as pd
+ import numpy as np
+ import matplotlib.pyplot as plt
+ import seaborn as sns

The Data

We’ll use the built in breast cancer dataset from Scikit Learn.

> from sklearn.datasets import load_breast_cancer
+ cancer = load_breast_cancer()

The data set is presented in a dictionary form:

> cancer.keys()
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

We can grab information and arrays out of this dictionary to set up our data frame and understanding of the features:

Set up DataFrame

> df = pd.DataFrame(cancer['data'],
+                   columns=cancer['feature_names'])
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension
17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871
20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667
19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999
11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744
20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883
radius error texture error perimeter error area error smoothness error compactness error concavity error concave points error symmetry error fractal dimension error
1.0950 0.9053 8.589 153.40 0.006399 0.04904 0.05373 0.01587 0.03003 0.006193
0.5435 0.7339 3.398 74.08 0.005225 0.01308 0.01860 0.01340 0.01389 0.003532
0.7456 0.7869 4.585 94.03 0.006150 0.04006 0.03832 0.02058 0.02250 0.004571
0.4956 1.1560 3.445 27.23 0.009110 0.07458 0.05661 0.01867 0.05963 0.009208
0.7572 0.7813 5.438 94.44 0.011490 0.02461 0.05688 0.01885 0.01756 0.005115
worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

PCA Visualization

It is difficult to visualize high dimensional data but we can use PCA to find the first two principal components and visualize the data with this new, two-dimensional space using a single scatter-plot. Before we do this though, we’ll need to scale our data so that each feature has a single unit variance.

> from sklearn.preprocessing import StandardScaler
> scaler = StandardScaler()
+ scaler.fit(df)
StandardScaler(copy=True, with_mean=True, with_std=True)
> scaled_data = scaler.transform(df)

PCA with Scikit Learn uses a very similar process to other pre-processing functions. We instantiate a PCA object, find the principal components using the fit method, then apply the rotation and dimensionality reduction by calling transform().

We can also specify how many components we want to keep when creating the PCA object.

> from sklearn.decomposition import PCA
> pca = PCA(n_components=2)
> pca.fit(scaled_data)
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

Now we can transform this data to its first 2 principal components.

> x_pca = pca.transform(scaled_data)
> scaled_data.shape
(569, 30)
> x_pca.shape
(569, 2)

We’ve reduced 30 dimensions to just 2. Let’s plot these two dimensions out.

> sns.set_style('darkgrid')
+ plt.figure(figsize=(8,6))
+ plt.scatter(x_pca[:,0],x_pca[:,1],
+             c=cancer['target'],cmap='plasma');
+ plt.xlabel('First principal component');
+ plt.ylabel('Second Principal Component');
+ plt.show()

Clearly by using these two components we can easily separate these two classes.

Interpreting the components

Unfortunately, with this great power of dimensionality reduction comes the cost of being able to easily understand what these components represent.

The components correspond to combinations of the original features, the components themselves are stored as an attribute of the fitted PCA object:

> pca.components_
array([[ 0.21890244,  0.10372458,  0.22753729,  0.22099499,  0.14258969,
         0.23928535,  0.25840048,  0.26085376,  0.13816696,  0.06436335,
         0.20597878,  0.01742803,  0.21132592,  0.20286964,  0.01453145,
         0.17039345,  0.15358979,  0.1834174 ,  0.04249842,  0.10256832,
         0.22799663,  0.10446933,  0.23663968,  0.22487053,  0.12795256,
         0.21009588,  0.22876753,  0.25088597,  0.12290456,  0.13178394],
       [-0.23385713, -0.05970609, -0.21518136, -0.23107671,  0.18611302,
         0.15189161,  0.06016536, -0.0347675 ,  0.19034877,  0.36657547,
        -0.10555215,  0.08997968, -0.08945723, -0.15229263,  0.20443045,
         0.2327159 ,  0.19720728,  0.13032156,  0.183848  ,  0.28009203,
        -0.21986638, -0.0454673 , -0.19987843, -0.21935186,  0.17230435,
         0.14359317,  0.09796411, -0.00825724,  0.14188335,  0.27533947]])

In this numpy matrix array, each row represents a principal component, and each column relates back to the original features. we can visualize this relationship with a heatmap:

> df_comp = pd.DataFrame(pca.components_,
+                        columns=cancer['feature_names'])
> plt.figure(figsize=(12,6))
+ sns.heatmap(df_comp,cmap='plasma');
+ plt.show()

This heatmap and the color bar basically represent the correlation between the various feature and the principal component itself.