Frank Noe
frank.noe@fu-berlin.de
FU Berlin,
Arnimallee 6, 14195 Berlin, Germany
Summary: The time-lagged independent component analysis (TICA) defines a linear transform of some (usually high-dimensional) set of input coordinates to some (usually low-dimensional) set of output coordinates. The transform is chosen such that amongst all linear transforms, TICA maximizes the autocorrelation of transformed coordinates. In other words, TICA finds a maximally slow subspace, or a subspace of good reaction coordinates when the input coordinates come from molecular dynamics. TICA is an excellent choice to process high-dimensional input data such as to prepare it for Markov model construction. |
We consider a $d$-dimensional vector of input data, called $\mathbf{r}(t)=(r_{i}(t))_{i=1,...,D}$. Here, $t$ is an integer from $\{1...N\}$ denoting the time step. We assume that the data is mean-free, i.e. from a general input vector $\tilde{\mathbf{r}}(t)$, we first obtain: $$\mathbf{r}(t)=\tilde{\mathbf{r}}(t)-\langle\tilde{\mathbf{r}}(t)\rangle_{t} \:\:\:\:(1)$$ where $\langle\tilde{\mathbf{r}}(t)\rangle_{t}$ is the data mean.
We first compute the covariance matrices from the data: $$c_{ij}(\tau)=\langle r_{i}(t)\: r_{j}(t+\tau)\rangle_{t} \:\:\:\:(2)$$ where $\tau$ is the lag time. We will need two matrices for TICA, for the choices $\tau=0$ and another positive value of $\tau$. $\langle\cdot\rangle_{t}$ denotes the time average. We can evaluate it as follows: $$c_{ij}(\tau)=\frac{1}{N-\tau-1}\sum_{t=1}^{N-\tau}r_{i}(t)\, r_{j}(t+\tau). \:\:\:\:(3)$$ It is easy to verify that $\mathbf{C}(0)$ is a symmetric matrix. For algebraic reasons we will need that $\mathbf{C}(\tau)$ is also symmetric, which is not automatically guaranteed when estimating it from a finite data set. Therefore we enforce symmetry from a data-computed matrix $\mathbf{C}_{d}(\tau)$: $$\mathbf{C}(\tau)=\frac{1}{2}\left(\mathbf{C}_{d}(\tau)+\mathbf{C}_{d}^{\top}(\tau)\right). \:\:\:\:(4)$$ Now we solve the generalized eigenvalue problem: $$\mathbf{C}(\tau)\:\mathbf{U}=\mathbf{C}(0)\:\mathbf{U}\:\boldsymbol{\Lambda} \:\:\:\:(5)$$ where $U$ is the eigenvector-matrix containing the independent components (ICs) in the columns and $\Lambda$ is a diagonal eigenvalue matrix. This problem can be solved by an appropriate generalized eigenvalue solver (directly).
Now, the data can be projected onto the TICA space: $$\mathbf{z}^{\top}(t)=\mathbf{r}^{\top}(t)\mathbf{U} \:\:\:\:(6)$$ in this step, we can perform a dimension reduction by selecting only a sub-matrix $\mathbf{U}$ consisting of the first $m$ columns of the full-rank $\mathbf{U}$.
The generalized eigenvalue problem can be solved in two steps. The two step procedure is called AMUSE algorithm [1] and works as follows:
Solve the simple PCA Eigenvalue problem $\mathbf{C}(0)\:\mathbf{W}=\mathbf{W}\:\boldsymbol{\Sigma}$, where $\mathbf{W}$ is the eigenvector matrix with principal components and $\boldsymbol{\Sigma}$ are their variances (diagonal Eigenvalue matrix).
Transform the mean-free data $\mathbf{r}(t)$ onto principal components $\mathbf{y}(t)=\mathbf{W}\:\mathbf{r}(t)$.
Normalize the principal components: $\mathbf{y}'(t)=\boldsymbol{\Sigma}^{-1}\mathbf{y}(t)$
Compute the symmetrized time-lagged covariance matrix of the normalized PCs: $\mathbf{C}_{sym}^{y}(\tau)=\frac{1}{2}\left[\mathbf{C}^{y}(\tau)+(\mathbf{C}^{y}(\tau))^{\top}\right]$
Perform an eigenvalue decomposition of $\mathbf{C}_{sym}^{y}(\tau)$ to obtain the eigenvector matrix $\mathbf{V}$ and project the trajectory onto the dominant eigenvectors to obtain $\mathbf{z}(t)$.
In summary, we can write the transformation equation in three linear transforms: $$\mathbf{z}^{\top}(t)=\mathbf{r}^{\top}(t)\mathbf{U}=\mathbf{r}^{\top}(t)\mathbf{W}\boldsymbol{\Sigma}^{-1}\mathbf{V}. \:\:\:\:(7)$$
TICA will be used as a dimension reduction technique. Only the dominant TICA components will be used to go to the next step. In order to reduce the dimension, use only a few columns of $\mathbf{U}$.
The TICA method was first introduced in [2]. TICA was introduced into the molecular dynamics literature and as a method for Markov model estimation independently in two papers: [5][6]. The fact that TICA is an optimal approximation to the Markov operator’s eigenvalues and eigenfunction was shown in [5].
import numpy as np
import pyemma.msm as msm
import pyemma.msm.generation as msmgen
import pyemma.msm.analysis as msmana
import pyemma.coordinates as coor
import matplotlib.pylab as plt
%pylab inline
def assign(X, cc):
T = X.shape[0]
I = np.zeros((T),dtype=int)
for t in range(T):
dists = X[t] - cc
dists = dists ** 2
I[t] = np.argmin(dists)
return I
Test data
We generate a Hidden-Markov Model with two Gaussians output functions in two dimensions, in such a way that the maximum variance direction differs from the direction of maximum autocorrelation.
P = np.array([[0.99, 0.01],
[0.01, 0.99]])
T = 50000
means = [np.array([-1,1]), np.array([1,-1])]
widths = [np.array([0.3,2]),np.array([0.3,2])]
# continuous trajectory
X = np.zeros((T, 2))
# hidden trajectory
dtraj = msmgen.generate_traj(P, T)
for t in range(T):
s = dtraj[t]
X[t,0] = widths[s][0] * numpy.random.randn() + means[s][0]
X[t,1] = widths[s][1] * numpy.random.randn() + means[s][1]
Perform PCA and TICA
We first compute the PCA transform of our trajectory data, and store eigenvalues (variances) and eigenvectors (principal components):
pca = coor.pca(data = X)
pc = pca.eigenvectors
S = pca.eigenvalues
km = coor.cluster_kmeans(data = X, k=100)
Next, we compute the TICA transform and store eigenvalues (autocorrelations) and eigenvectors (independent components)
tica = coor.tica(data = X)
ic = tica.eigenvectors
L = tica.eigenvalues
Now, we plot our results. The data is shown in the form of a scatter plot, and the principal components / independent components are shown as arrows, scaled by the magnitude of the associated eigenvalue
def draw_arrow(a, v, color):
plt.arrow(0, 0, a*v[0], a*v[1], color=color, width=0.02, linewidth=3)
plt.figure(figsize=(4,7))
scatter(X[:,0], X[:,1], marker = '.', color='black')
draw_arrow(S[0], pc[:,0], color='blue')
#draw_arrow(S[1], pc[:,1], color='blue')
plt.text(0.5, 2.5, 'PCA', color='blue', fontsize = 20, fontweight='bold', rotation='vertical')
draw_arrow(4*L[0], ic[:,0], color='orange')
#draw_arrow(4*L[1], ic[:,1], color='orange')
plt.text(-1.5, 1.7, 'TICA', color='orange', fontsize = 20, fontweight='bold', rotation='vertical')
xlim(-4,4)
ylim(-7,7)
We can clearly see that the first PCA component is along the direct of maximal variance in the data, which is in this case not very informative for resolving the rare event that takes place between the two elongated Normal distributions. The second principal components is orthogonal to the first component in the real space.
TICA successfully identifies the slow reaction coordinate. The corresponding eigenvalue (autocorrelation) is nearly one as the associated process is slow. The second TICA component is orthogonal in a scaled space and thus appears nonorthogonal in the real space. The eigenvalue associated with the second TICA component is nearly zero, indicating that no second slow process exists
Projection
Now we project the input data onto the first principal or independent component, respectively:
Ypca = pca.get_output()[0][:,0]
Ytica = tica.get_output()[0][:,0]
For better visual clarity, the TICA trajectory is just scaled by a multiple. We can see that the TICA projection clearly distinguishes between the two metastable states, while the PCA projection lumps them together:
hist(Ypca, bins=50, histtype='step', linewidth=3, label='PCA', color='blue')
hist(4*Ytica, bins=50, histtype='step', linewidth=3, label='TICA', color='orange')
xlabel('essential coordinate (1st principal or indipendent component)')
ylabel('projected histogram')
legend()
1D-MSM
The advantage of a projection that preserves the slow process directions is immediately seen when building a Markov state model along the first coordinate, either for the PCA or TICA projection:
cc_pca = np.linspace(np.min(Ypca), np.max(Ypca))
I_pca = assign(Ypca[:,None], cc_pca)
cc_tica = np.linspace(np.min(Ypca), np.max(Ypca))
I_tica = assign(Ytica[:,None], cc_tica)
lags = [1,2,5,7,10,15,20]
its_pca = msm.its([I_pca], lags)
its_tica = msm.its([I_tica], lags)
plot([1,20],[msmana.timescales(P)[1],msmana.timescales(P)[1]], linewidth=3, linestyle='dashed', color='black', label='exact')
plot(its_pca.get_lagtimes(), its_pca.get_timescales()[:,0], linewidth=3, color='blue', label='PCA')
plot(its_tica.get_lagtimes(), its_tica.get_timescales()[:,0], linewidth=3, color='orange', label='TICA')
xlabel('lag time')
ylabel('relaxation timescale')
legend(loc='center')
While the 1D-Markov model of the PCA projection gives a very poor and slowly converging estimate of the relaxation timescales, the 1D-Markov model of the TICA projection recovers the true timescale accurately and for short lag times.
[1]: Aapo Hyvärinen, Juha Karhunen, Erkki Oja: Independent Component Analysis. John Wiley \& Sons. (2001).
[2]: Molgedey, L. and Schuster, H. G.: Separation of a mixture of independent signals using time delayed correlations. Phys. Rev. Lett. 72, 3634-3637 (1994).
[3]: F. Noé and F. Nüske: A variational approach to modeling slow processes in stochastic dynamical systems. SIAM Multiscale Model. Simul. 11, 635-655 (2013).
[4]: F. Nüske, B. Keller, G. Pérez-Hernández, A. S. J. S. Mey and F. Noé: Variational Approach to Molecular Kinetics. J. Chem. Theory Comput. 10, 1739-1752 (2014).
[5]: G. Perez-Hernandez, F. Paul, T. Giorgino, G. {de Fabritiis} and Frank Noé: Identification of slow molecular order parameters for Markov model construction. J. Chem. Phys. 139, 015102 (2013).
[6]: C. R. Schwantes and V. S. Pande: Improvements in Markov State Model Construction Reveal Many Non-Native Interactions in the Folding of NTL9. J. Chem. Theory Comput. 9, 2000-2009 (2013).