Principal Component Analysis

PCA is a transformation of your data. It attempts to find out what features explain the most variance. For example:

> import pandas as pd
+ import numpy as np
+ import matplotlib.pyplot as plt
+ import seaborn as sns

The Data

We’ll use the built in breast cancer dataset from Scikit Learn.

> from sklearn.datasets import load_breast_cancer
+ cancer = load_breast_cancer()

The data set is presented in a dictionary form:

> cancer.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

We can grab information and arrays out of this dictionary to set up our data frame and understanding of the features:

Set up DataFrame

> df = pd.DataFrame(cancer['data'],
+                   columns=cancer['feature_names'])

mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension
17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871
20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667
19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999
11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744
20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883

radius error	texture error	perimeter error	area error	smoothness error	compactness error	concavity error	concave points error	symmetry error	fractal dimension error
1.0950	0.9053	8.589	153.40	0.006399	0.04904	0.05373	0.01587	0.03003	0.006193
0.5435	0.7339	3.398	74.08	0.005225	0.01308	0.01860	0.01340	0.01389	0.003532
0.7456	0.7869	4.585	94.03	0.006150	0.04006	0.03832	0.02058	0.02250	0.004571
0.4956	1.1560	3.445	27.23	0.009110	0.07458	0.05661	0.01867	0.05963	0.009208
0.7572	0.7813	5.438	94.44	0.011490	0.02461	0.05688	0.01885	0.01756	0.005115

worst radius	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
25.38	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
24.99	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
23.57	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
14.91	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
22.54	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

PCA Visualization

It is difficult to visualize high dimensional data but we can use PCA to find the first two principal components and visualize the data with this new, two-dimensional space using a single scatter-plot. Before we do this though, we’ll need to scale our data so that each feature has a single unit variance.

> from sklearn.preprocessing import StandardScaler

> scaler = StandardScaler()
+ scaler.fit(df)

StandardScaler(copy=True, with_mean=True, with_std=True)

> scaled_data = scaler.transform(df)

PCA with Scikit Learn uses a very similar process to other pre-processing functions. We instantiate a PCA object, find the principal components using the fit method, then apply the rotation and dimensionality reduction by calling transform().

We can also specify how many components we want to keep when creating the PCA object.

> from sklearn.decomposition import PCA

> pca = PCA(n_components=2)

> pca.fit(scaled_data)

PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

Now we can transform this data to its first 2 principal components.

> x_pca = pca.transform(scaled_data)

> scaled_data.shape

(569, 30)

> x_pca.shape

(569, 2)

We’ve reduced 30 dimensions to just 2. Let’s plot these two dimensions out.

> sns.set_style('darkgrid')
+ plt.figure(figsize=(8,6))
+ plt.scatter(x_pca[:,0],x_pca[:,1],
+             c=cancer['target'],cmap='plasma');
+ plt.xlabel('First principal component');
+ plt.ylabel('Second Principal Component');
+ plt.show()

Clearly by using these two components we can easily separate these two classes.

Interpreting the components

Unfortunately, with this great power of dimensionality reduction comes the cost of being able to easily understand what these components represent.

The components correspond to combinations of the original features, the components themselves are stored as an attribute of the fitted PCA object:

> pca.components_

array([[ 0.21890244,  0.10372458,  0.22753729,  0.22099499,  0.14258969,
         0.23928535,  0.25840048,  0.26085376,  0.13816696,  0.06436335,
         0.20597878,  0.01742803,  0.21132592,  0.20286964,  0.01453145,
         0.17039345,  0.15358979,  0.1834174 ,  0.04249842,  0.10256832,
         0.22799663,  0.10446933,  0.23663968,  0.22487053,  0.12795256,
         0.21009588,  0.22876753,  0.25088597,  0.12290456,  0.13178394],
       [-0.23385713, -0.05970609, -0.21518136, -0.23107671,  0.18611302,
         0.15189161,  0.06016536, -0.0347675 ,  0.19034877,  0.36657547,
        -0.10555215,  0.08997968, -0.08945723, -0.15229263,  0.20443045,
         0.2327159 ,  0.19720728,  0.13032156,  0.183848  ,  0.28009203,
        -0.21986638, -0.0454673 , -0.19987843, -0.21935186,  0.17230435,
         0.14359317,  0.09796411, -0.00825724,  0.14188335,  0.27533947]])

In this numpy matrix array, each row represents a principal component, and each column relates back to the original features. we can visualize this relationship with a heatmap:

> df_comp = pd.DataFrame(pca.components_,
+                        columns=cancer['feature_names'])

> plt.figure(figsize=(12,6))
+ sns.heatmap(df_comp,cmap='plasma');
+ plt.show()

This heatmap and the color bar basically represent the correlation between the various feature and the principal component itself.

Python for Principal Component Analysis

Python code in R Markdown

Paul Jozefek

2020-09-28

Principal Component Analysis

The Data

Set up DataFrame

PCA Visualization

Interpreting the components