Maximum likelihood transition matrix¶
Frank Noe
frank.noe@fu-berlin.de
FU Berlin,
Arnimallee 6, 14195 Berlin, Germany
Summary:
We have observed one or several trajectories amongst $n$ discrete
states. At lag time $\tau$, there have been $c_{ij}=c_{ij}(\tau)$
transitions between these states, as encoded in a count matrix
$\mathbf{C}\in\mathbb{R}^{n\times n}$. We aim at estimating a Markov
transition matrix, $\hat{\mathbf{P}}\in\mathbb{R}^{n\times n}$.
Mathematically, $$\begin{aligned}
\hat{\mathbf{P}} & = & \arg\max_{\mathbf{P}}p(\mathbf{C}\mid\mathbf{P}).\end{aligned}$$
It turns out that the maximum likelihood estimator is identical to the
intuitive fraction of counts,
$\hat{p}_{ij}=c_{ij}/\sum_{k=1}^{n}c_{ik}$.
|
Consider that we have observed a trajectory $[s_{1},...,s_{T}]$ amongst
$n$ discrete states labeled $1$ through $n$. The number of transitions
between any pair of states $i$ and $j$ are denoted
$\mathbf{C}=[c_{ij}^{\mathrm{obs}}]$, i.e.:
$$c_{ij}^{\mathrm{obs}}=\left|\{s_{t}=i\wedge s_{t+1}=j\mid t=1,...,T-1\}\right|
\:\:\:\:(1)$$ and the total number of transitions out of state $i$ is
$c_{i}^{\mathrm{obs}}$:
$$c_{i}^{\mathrm{obs}}=\sum_{j=1}^{n}c_{ij}^{\mathrm{obs}}.
\:\:\:\:(2)$$ The likelihood of a Markov chain with transition matrix
$\mathbf{P}=[p_{ij}]\in\mathbb{R}^{n\times n}$ is given by:
$$p(\mathbf{C}^{\mathrm{obs}}|\mathbf{T})=\prod_{i,j=1}^{n}p_{ij}^{c_{ij}^{\mathrm{obs}}}
\:\:\:\:(3)$$ Our aim is to estimate the transition matrix that
maximizes (
3). In order to generalize our approach
we also consider the maximization of a posterior probability which uses
a conjugate prior, i.e. a prior with the same functional form as the
likelihood:
$$p(\mathbf{T}|\mathbf{C}^{\mathrm{obs}})\propto\prod_{i,j=1}^{n}p_{ij}^{c_{ij}^{\mathrm{prior}}+c_{ij}^{\mathrm{obs}}}=\prod_{i,j=1}^{n}p_{ij}^{c_{ij}},
\:\:\:\:(4)$$ Where $c_{ij}^{\mathrm{prior}}$ are prior counts. We now
seek the matrix that maximizes (
4). For the choice
$c_{ij}^{\mathrm{prior}}\equiv0$ (uniform prior), this is identical to
the maximum likelihood estimator. The probability
(
4) is difficult to work with due to the product.
For optimization purposes it is therefore a common “trick” to instead
work with the logarithm of the likelihood (log-likelihood):
$$Q=\log p(\mathbf{P}\mid\mathbf{C})=\sum_{i,j}c_{ij}\log p_{ij}.
\:\:\:\:(5)$$ This is useful since the logarithm is a monotonic
function: as a result, the maximum of $\log f$ is also the maximum of
$f$. However, this function is not bounded from above, since for
$p_{ij}\rightarrow\infty$, $Q\rightarrow\infty$. Of course, we somehow
need to restrict ourselves to sets of variables which actually form
transition matrices, i.e., they satisfy the constraint:
$$\sum_{j}p_{ij}=1.
\:\:\:\:(6)$$ When optimizing with equality constraints, one uses
Langrangian multipliers. The Lagrangian for $Q$ is given by:
$$F=Q+\lambda_{1}(\sum_{j}p_{1j}-1)+...+\lambda_{m}(\sum_{j}p_{mj}-1).
\:\:\:\:(7)$$ This function is maximized by the maximum likelihood
transition matrix. It turns out that $F$ only has a single stationary
point, which can be easily found by setting the partial derivatives,
$$\frac{\partial\log F}{\partial p_{ij}}=\frac{c_{ij}}{p_{ij}}+\lambda_{i},
\:\:\:\:(8)$$ to zero:
$$\begin{aligned}
\frac{c_{ij}}{\hat{p}_{ij}}+\lambda_{i} & =0\\
\lambda_{i} & \hat{p}_{ij}=-c_{ij}.
\:\:\:\:(9)\end{aligned}$$ We now sum over $j$ on both sides and make
use of the transition matrix property:
$$\lambda_{i}\sum_{j=1}^{m}\hat{T}_{ij}=\lambda_{i}=-\sum_{j=1}^{m}c_{ij}=-c_{i}.
\:\:\:\:(10)$$ Now we inserte the Langrange multipliers,
$\lambda_{i}=-c_{i}$, in Eq. (
9) and thus:
$$\begin{aligned}
\frac{c_{ij}}{\hat{p}_{ij}}-c_{i} & =0\\
\hat{p}_{ij} & =\frac{c_{ij}}{c_{i}}.
\:\:\:\:(11)\end{aligned}$$ It turns out that $\hat{\mathbf{P}}(\tau)$,
as provided by Eq. (
11), is the maximum of
$p(\mathbf{P}|\mathbf{C}^{\mathrm{obs}})$ and thus also of
$p(\mathbf{C}^{\mathrm{obs}}|\mathbf{P})$ when transition matrices are
assumed to be uniformly distributed
a priori.
In the limit of infinite sampling, i.e., trajectory length
$T\rightarrow\infty$, $p(\mathbf{P}|\mathbf{C}^{\mathrm{obs}})$
converges towards a Dirac delta distribution with its peak at
$\hat{\mathbf{P}}(\tau)$. Given that $\pi_{i}$ is the equilibrium
probability of state $i$, the expected number of transitions out of
state $i$ is $$\mathbb{E}[c_{ij}]=N\pi_{i}p_{ij}
\:\:\:\:(12)$$ and thus $$\mathbb{E}[c_{i}]=N\pi_{i}.
\:\:\:\:(13)$$ In this case the prior contribution vanishes:
$$\lim_{N\rightarrow\infty}\hat{p}_{ij}=\lim_{N\rightarrow\infty}\frac{c_{ij}}{c_{i}}=\lim_{N\rightarrow\infty}\frac{c_{ij}^{\mathrm{prior}}+N\pi_{i}p_{ij}}{c_{ij}^{\mathrm{prior}}+\pi_{i}p_{ij}}=p_{ij},
\:\:\:\:(14)$$
i.e., the estimator is “asymptotically unbiased”.
Acknowledgements:¶
Part of this article has been published in
[2].
Citing maximum likelihood transition matrix estimation:¶
This is textbook knowledge. Cite any suitable book on Markov chains,
such as [1].