In this article I want to explain how a Principal Component Analysis (PCA) works by implementing it in Python step by step. At the end we will compare the results
to the more convenient Python
Sections
IntroductionThe main purposes of a principal component analysis are the analysis of data to identify patterns and finding patterns to reduce the dimensions of the dataset with minimal loss of information. Here, our desired outcome of the principal component analysis is to project a feature space (our dataset consisting of n x d-dimensional samples) onto a smaller subspace that represents our data "well". A possible application would be a pattern classification task, where we want to reduce the computational costs and the error of parameter estimation by reducing the number of dimensions of our feature space by extracting a subspace that describes our data "best". About the notation: Principal Component Analysis (PCA) Vs. Multiple Discriminant Analysis (MDA)Both Multiple Discriminant Analysis (MDA) and Principal Component Analysis (PCA) are linear transformation methods and closely related to each other. In PCA, we are interested to find the directions (components) that maximize the variance in our dataset, where in MDA, we are additionally interested to find the directions that maximize the separation (or discrimination) between different classes (for example, in pattern classification problems where our dataset consists of multiple classes. In contrast two PCA, which ignores the class labels). What is a "good" subspace?Let's assume that our goal is to reduce the dimensions of a d-dimensional dataset by projecting it onto a k-dimensional subspace (where k < d).
So, how do we know what size we should choose for k, and how do we know if we have a feature space that represents our data "well"? Summarizing the PCA approachListed below are the 6 general steps for performing a principal component analysis, which we will investigate in the following sections.
Generating some 3-dimensional sample dataFor the following example, we will generate 40x3-dimensional samples randomly drawn from a multivariate Gaussian distribution. Why are we chosing a 3-dimensional sample?The problem of multi-dimensional data is its visualization, which would make it quite tough to follow our example principal component analysis (at least visually). We could also choose a 2-dimensional sample data set for the following examples, but since the goal of the PCA in an "Diminsionality Reduction" application is to drop at least one of the dimensions, I find it more intuitive and visually appealing to start with a 3-dimensional dataset that we reduce to an 2-dimensional dataset by dropping 1 dimension. import numpy as np np.random.seed(234234782384239784) # random seed for consistency mu_vec1 = np.array([0,0,0]) cov_mat1 = np.array([[1,0,0],[0,1,0],[0,0,1]]) class1_sample = np.random.multivariate_normal(mu_vec1, cov_mat1, 20).T assert class1_sample.shape == (3,20), "The matrix has not the dimensions 3x20" mu_vec2 = np.array([1,1,1]) cov_mat2 = np.array([[1,0,0],[0,1,0],[0,0,1]]) class2_sample = np.random.multivariate_normal(mu_vec2, cov_mat2, 20).T assert class1_sample.shape == (3,20), "The matrix has not the dimensions 3x20" Using the code above, we created two 3x20-datasets - one dataset for each class ω1 and ω2 - Just to get a rough idea how the samples of our two classes ω1 and ω2 are distributed, let us plot them in a 3D scatter plot. from matplotlib import pyplot as plt from mpl_toolkits.mplot3d import Axes3D from mpl_toolkits.mplot3d import proj3d fig = plt.figure(figsize=(8,8)) ax = fig.add_subplot(111, projection='3d') mpl.rcParams['legend.fontsize'] = 10 ax.plot(class1_sample[0,:], class1_sample[1,:],\ class1_sample[2,:], 'o', markersize=8, color='blue', alpha=0.5, label='class1') ax.plot(class2_sample[0,:], class2_sample[1,:],\ class2_sample[2,:], '^', markersize=8, alpha=0.5, color='red', label='class2') plt.title('Samples for class 1 and class 2') ax.legend(loc='upper right') plt.draw() plt.show() 1. Taking the whole dataset ignoring the class labelsBecause we don't need class labels for the PCA analysis, let us merge the samples for our 2 classes into one 3x40-dimensional array. all_samples = np.concatenate((class1_sample, class2_sample), axis=1) assert all_samples.shape == (3,40), "The matrix has not the dimensions 3x40" 2. Computing the d-dimensional mean vectormean_x = np.mean(all_samples[0,:]) mean_y = np.mean(all_samples[1,:]) mean_z = np.mean(all_samples[2,:]) mean_vector = np.array([[mean_x],[mean_y],[mean_z]]) print('Mean Vector:\n', mean_vector) Mean Vector: [[ 0.50576644] [ 0.30186591] [ 0.76459177]] 3. a) Computing the Scatter Matrixscatter_matrix = np.zeros((3,3)) for i in range(all_samples.shape[1]): scatter_matrix += (all_samples[:,i].reshape(3,1) - mean_vector).dot((all_samples[:,i].reshape(3,1) - mean_vector).T) print('Scatter Matrix:\n', scatter_matrix) Scatter Matrix: [[ 48.91593255 7.11744916 7.20810281] [ 7.11744916 37.92902984 2.7370493 ] [ 7.20810281 2.7370493 35.6363759 ]] 3. b) Computing the Covariance Matrix (alternatively to the scatter matrix)Alternatively, instead of calculating the scatter matrix, we could also calculate the covariance matrix using the in-built cov_mat = np.cov([all_samples[0,:],all_samples[1,:],all_samples[2,:]]) print('Covariance Matrix:\n', cov_mat) Covariance Matrix: [[ 1.25425468 0.1824987 0.18482315] [ 0.1824987 0.97253923 0.07018075] [ 0.18482315 0.07018075 0.91375323]] 4. Computing eigenvectors and corresponding eigenvaluesTo show that the eigenvectors are indeed identical whether we derived them from the scatter or the covariance matrix, let us put an # eigenvectors and eigenvalues for the from the scatter matrix eig_val_sc, eig_vec_sc = np.linalg.eig(scatter_matrix) # eigenvectors and eigenvalues for the from the covariance matrix eig_val_cov, eig_vec_cov = np.linalg.eig(cov_mat) for i in range(len(eig_val_sc)): eigvec_sc = eig_vec_sc[:,i].reshape(1,3).T eigvec_cov = eig_vec_cov[:,i].reshape(1,3).T assert eigvec_sc.all() == eigvec_cov.all(), 'Eigenvectors are not identical' print('Eigenvector {}: \n{}'.format(i+1, eigvec_sc)) print('Eigenvalue {} from scatter matrix: {}'.format(i+1, eig_val_sc[i])) print('Eigenvalue {} from covariance matrix: {}'.format(i+1, eig_val_cov[i])) print('Scaling factor: ', eig_val_sc[i]/eig_val_cov[i]) print(40 * '-') Eigenvector 1: [[-0.84190486] [-0.39978877] [-0.36244329]] Eigenvalue 1 from scatter matrix: 55.398855957302445 Eigenvalue 1 from covariance matrix: 1.4204834860846791 Scaling factor: 39.0 ---------------------------------------- Eigenvector 2: [[-0.44565232] [ 0.13637858] [ 0.88475697]] Eigenvalue 2 from scatter matrix: 32.42754801292286 Eigenvalue 2 from covariance matrix: 0.8314755900749456 Scaling factor: 39.0 ---------------------------------------- Eigenvector 3: [[ 0.30428639] [-0.90640489] [ 0.29298458]] Eigenvalue 3 from scatter matrix: 34.65493432806495 Eigenvalue 3 from covariance matrix: 0.8885880596939733 Scaling factor: 39.0 ----------------------------------------
Checking the eigenvector-eigenvalue calculationLet us quickly check that the eigenvector-eigenvalue calculation is correct and satisfy the equation
for i in range(len(eig_val_sc)): eigv = eig_vec_sc[:,i].reshape(1,3).T np.testing.assert_array_almost_equal(scatter_matrix.dot(eigv), eig_val_sc[i] * eigv, decimal=6, err_msg='', verbose=True)
Visualizing the eigenvectorsAnd before we move on to the next step, just to satisfy our own curiosity, we plot the eigenvectors centered at the sample mean. from matplotlib import pyplot as plt from mpl_toolkits.mplot3d import Axes3D from mpl_toolkits.mplot3d import proj3d from matplotlib.patches import FancyArrowPatch class Arrow3D(FancyArrowPatch): def __init__(self, xs, ys, zs, *args, **kwargs): FancyArrowPatch.__init__(self, (0,0), (0,0), *args, **kwargs) self._verts3d = xs, ys, zs def draw(self, renderer): xs3d, ys3d, zs3d = self._verts3d xs, ys, zs = proj3d.proj_transform(xs3d, ys3d, zs3d, renderer.M) self.set_positions((xs[0],ys[0]),(xs[1],ys[1])) FancyArrowPatch.draw(self, renderer) fig = plt.figure(figsize=(7,7)) ax = fig.add_subplot(111, projection='3d') ax.plot(all_samples[0,:], all_samples[1,:], all_samples[2,:], 'o', markersize=8, color='green', alpha=0.2) ax.plot([mean_x], [mean_y], [mean_z], 'o', markersize=10, color='red', alpha=0.5) for v in eig_vec_sc.T: a = Arrow3D([mean_x, v[0]], [mean_y, v[1]], [mean_z, v[2]], mutation_scale=20, lw=3, arrowstyle="-|>", color="r") ax.add_artist(a) ax.set_xlabel('x_values') ax.set_ylabel('y_values') ax.set_zlabel('z_values') plt.title('Eigenvectors') plt.show()
5.1. Sorting the eigenvectors by decreasing eigenvaluesWe started with the goal to reduce the dimensionality of our feature space, i.e., projecting the feature space via PCA onto a smaller subspace, where the eigenvectors will form the axes of this new feature subspace. However, the eigenvectors only define the directions of the new axis, since they have all the same unit length 1, which we can confirm by the following code: for ev in eig_vec_sc: numpy.testing.assert_array_almost_equal(1.0, np.linalg.norm(ev)) # instead of 'assert' because of rounding errors So, in order to decide which eigenvector(s) we want to drop for our lower-dimensional subspace, we have to take a look at the corresponding eigenvalues of the eigenvectors. Roughly speaking, the eigenvectors with the lowest eigenvalues bear the least information about the distribution of the data, and those are the ones we want to drop. # Make a list of (eigenvalue, eigenvector) tuples eig_pairs = [(np.abs(eig_val_sc[i]), eig_vec_sc[:,i]) for i in range(len(eig_val_sc))] # Sort the (eigenvalue, eigenvector) tuples from high to low eig_pairs.sort() eig_pairs.reverse() # Visually confirm that the list is correctly sorted by decreasing eigenvalues for i in eig_pairs: print(i[0]) 55.3988559573 34.6549343281 32.4275480129
5.2. Choosing k eigenvectors with the largest eigenvaluesFor our simple example, where we are reducing a 3-dimensional feature space to a 2-dimensional feature subspace, we are combining the two eigenvectors with the highest eigenvalues to construct our d x k-dimensional eigenvector matrix W. matrix_w = np.hstack((eig_pairs[0][1].reshape(3,1), eig_pairs[1][1].reshape(3,1))) print('Matrix W:\n', matrix_w) Matrix W: [[-0.84190486 0.30428639] [-0.39978877 -0.90640489] [-0.36244329 0.29298458]]
6. Transforming the samples onto the new subspaceIn the last step, we use the 2x3-dimensional matrix W that we just computed to transform our samples onto the new subspace via the equation transformed = matrix_w.T.dot(all_samples) assert transformed.shape == (2,40), "The matrix is not 2x40 dimensional." plt.plot(transformed[0,0:20], transformed[1,0:20], 'o', markersize=7, color='blue', alpha=0.5, label='class1') plt.plot(transformed[0,20:40], transformed[1,20:40], '^', markersize=7, color='red', alpha=0.5, label='class2') plt.xlim([-4,4]) plt.ylim([-4,4]) plt.xlabel('x_values') plt.ylabel('y_values') plt.legend() plt.title('Transformed samples with class labels') plt.draw() plt.show() Using the PCA() class from the matplotlib.mlab libraryNow, that we have seen how a principal component analysis works, we can use the in-built And the original code implementation of the Class attributes of
|
|