Maximum likelihood transition matrix

FU Berlin, Arnimallee 6, 14195 Berlin, Germany

Summary: We have observed one or several trajectories amongst $n$ discrete states. At lag time $\tau$, there have been $c_{ij}=c_{ij}(\tau)$ transitions between these states, as encoded in a count matrix $\mathbf{C}\in\mathbb{R}^{n\times n}$. We aim at estimating a Markov transition matrix, $\hat{\mathbf{P}}\in\mathbb{R}^{n\times n}$. Mathematically, $$\begin{aligned} \hat{\mathbf{P}} & = & \arg\max_{\mathbf{P}}p(\mathbf{C}\mid\mathbf{P}).\end{aligned}$$ It turns out that the maximum likelihood estimator is identical to the intuitive fraction of counts, $\hat{p}_{ij}=c_{ij}/\sum_{k=1}^{n}c_{ik}$.

Consider that we have observed a trajectory $[s_{1},...,s_{T}]$ amongst $n$ discrete states labeled $1$ through $n$. The number of transitions between any pair of states $i$ and $j$ are denoted $\mathbf{C}=[c_{ij}^{\mathrm{obs}}]$, i.e.: $$c_{ij}^{\mathrm{obs}}=\left|\{s_{t}=i\wedge s_{t+1}=j\mid t=1,...,T-1\}\right| \:\:\:\:(1)$$ and the total number of transitions out of state $i$ is $c_{i}^{\mathrm{obs}}$: $$c_{i}^{\mathrm{obs}}=\sum_{j=1}^{n}c_{ij}^{\mathrm{obs}}. \:\:\:\:(2)$$ The likelihood of a Markov chain with transition matrix $\mathbf{P}=[p_{ij}]\in\mathbb{R}^{n\times n}$ is given by: $$p(\mathbf{C}^{\mathrm{obs}}|\mathbf{T})=\prod_{i,j=1}^{n}p_{ij}^{c_{ij}^{\mathrm{obs}}} \:\:\:\:(3)$$ Our aim is to estimate the transition matrix that maximizes (3). In order to generalize our approach we also consider the maximization of a posterior probability which uses a conjugate prior, i.e. a prior with the same functional form as the likelihood: $$p(\mathbf{T}|\mathbf{C}^{\mathrm{obs}})\propto\prod_{i,j=1}^{n}p_{ij}^{c_{ij}^{\mathrm{prior}}+c_{ij}^{\mathrm{obs}}}=\prod_{i,j=1}^{n}p_{ij}^{c_{ij}}, \:\:\:\:(4)$$ Where $c_{ij}^{\mathrm{prior}}$ are prior counts. We now seek the matrix that maximizes (4). For the choice $c_{ij}^{\mathrm{prior}}\equiv0$ (uniform prior), this is identical to the maximum likelihood estimator. The probability (4) is difficult to work with due to the product. For optimization purposes it is therefore a common “trick” to instead work with the logarithm of the likelihood (log-likelihood): $$Q=\log p(\mathbf{P}\mid\mathbf{C})=\sum_{i,j}c_{ij}\log p_{ij}. \:\:\:\:(5)$$ This is useful since the logarithm is a monotonic function: as a result, the maximum of $\log f$ is also the maximum of $f$. However, this function is not bounded from above, since for $p_{ij}\rightarrow\infty$, $Q\rightarrow\infty$. Of course, we somehow need to restrict ourselves to sets of variables which actually form transition matrices, i.e., they satisfy the constraint: $$\sum_{j}p_{ij}=1. \:\:\:\:(6)$$ When optimizing with equality constraints, one uses Langrangian multipliers. The Lagrangian for $Q$ is given by: $$F=Q+\lambda_{1}(\sum_{j}p_{1j}-1)+...+\lambda_{m}(\sum_{j}p_{mj}-1). \:\:\:\:(7)$$ This function is maximized by the maximum likelihood transition matrix. It turns out that $F$ only has a single stationary point, which can be easily found by setting the partial derivatives, $$\frac{\partial\log F}{\partial p_{ij}}=\frac{c_{ij}}{p_{ij}}+\lambda_{i}, \:\:\:\:(8)$$ to zero: $$\begin{aligned} \frac{c_{ij}}{\hat{p}_{ij}}+\lambda_{i} & =0\\ \lambda_{i} & \hat{p}_{ij}=-c_{ij}. \:\:\:\:(9)\end{aligned}$$ We now sum over $j$ on both sides and make use of the transition matrix property: $$\lambda_{i}\sum_{j=1}^{m}\hat{T}_{ij}=\lambda_{i}=-\sum_{j=1}^{m}c_{ij}=-c_{i}. \:\:\:\:(10)$$ Now we inserte the Langrange multipliers, $\lambda_{i}=-c_{i}$, in Eq. (9) and thus: $$\begin{aligned} \frac{c_{ij}}{\hat{p}_{ij}}-c_{i} & =0\\ \hat{p}_{ij} & =\frac{c_{ij}}{c_{i}}. \:\:\:\:(11)\end{aligned}$$ It turns out that $\hat{\mathbf{P}}(\tau)$, as provided by Eq. (11), is the maximum of $p(\mathbf{P}|\mathbf{C}^{\mathrm{obs}})$ and thus also of $p(\mathbf{C}^{\mathrm{obs}}|\mathbf{P})$ when transition matrices are assumed to be uniformly distributed a priori.

In the limit of infinite sampling, i.e., trajectory length $T\rightarrow\infty$, $p(\mathbf{P}|\mathbf{C}^{\mathrm{obs}})$ converges towards a Dirac delta distribution with its peak at $\hat{\mathbf{P}}(\tau)$. Given that $\pi_{i}$ is the equilibrium probability of state $i$, the expected number of transitions out of state $i$ is $$\mathbb{E}[c_{ij}]=N\pi_{i}p_{ij} \:\:\:\:(12)$$ and thus $$\mathbb{E}[c_{i}]=N\pi_{i}. \:\:\:\:(13)$$ In this case the prior contribution vanishes: $$\lim_{N\rightarrow\infty}\hat{p}_{ij}=\lim_{N\rightarrow\infty}\frac{c_{ij}}{c_{i}}=\lim_{N\rightarrow\infty}\frac{c_{ij}^{\mathrm{prior}}+N\pi_{i}p_{ij}}{c_{ij}^{\mathrm{prior}}+\pi_{i}p_{ij}}=p_{ij}, \:\:\:\:(14)$$

i.e., the estimator is “asymptotically unbiased”.


Part of this article has been published in [2].

Citing maximum likelihood transition matrix estimation:

This is textbook knowledge. Cite any suitable book on Markov chains, such as [1].


[1]: Norris, J. R.: Markov Chains. Cambridge University Press. Cambridge Series in Statistical and Probabilistic Mathematics (1998).

[2]: J.-H. Prinz, H. Wu, M. Sarich, B. Keller, M. Senne, M. Held, J. D. Chodera, C. Schütte and F. Noé: Markov models of molecular kinetics: Generation and Validation. J. Chem. Phys. 134, 174105 (2011).