误差复印现象:Derivation of the Baum

来源：百度文库编辑：偶看新闻时间：2024/05/17 00:58:19

Next: Estimating the means of Up: Bayesian Learning 2: EM Previous: The EM Algorithm

Derivation of the Baum-Welch algorithm for HMMs

Consider our formulation of HMMs as before, i.e. is the alphabet, is the output sequence, s_t is the state at time t, A is the transition probability a_ij = P(s_t=j|s_t-1=i) and B is the state-symbol probability for observing a given output symbol b_jk = P(O_t = v_k|s_t=s).

Let Z = f(A,B) be a random variable such that x_ijk = f(A,B) denotes the joint event consisting of the transition from state i to j and the observation of the v_k from state j. So the observed output is completely specified by a sequence . Let p_ijk = P(x_ijk).

Then the parameter of interest, , is exactly Z since it represents the totality of transition and state-symbol probabilities.

We wish to find a such that

The expression on the left is called Baum's auxiliary function. We have already seen that if the above inequality holds, then P'(O) > P(O). So maximizing the LHS wrt. is equivalent to maximizing P(O).

To do this, we first find a stationary point of the LHS subject to the constraints . Using Lagrange multipliers, the function to maximize with respect to the variables p_ijk' is thus . Then
=(16)

Since X completely specifies O (recall that x_ijk also specifies that symbol v_k is observed in state j),

If c_ijk denotes the number of times x_ijk occurs in X, then

Hence,

We can also satisfy ourselves that

and so the stationary point obtained by setting the first derivative to zero is indeed a maximum.

Substituting in (17) and setting to zero,

If we now write , then

where K_i can be interpreted as a normalizing constant. The RHS can now be summed over each of the times by grouping the sequences such that in each group x_{i_tj_tk_t} = x_ijk.

The RHS above is just the reestimated probability of x_ijk because prior to normalization, the sum yields the expected frequency of the x_ijk computed using the current values of the p_ijk.

We can now proceed to calculate each p_ijk' as above. The set of all such p_ijk's constitutes our which we can use in place of during the next iteration.

The EM theorem tells us that if we continue in this fashion, then after each iteration

1.: either the reestimated is more likely than the original in the sense that or
2.: we have reached a stationary point of the likelihood function w at which

This is an algorithm, because it is iterative. There is an expectation step (aka estimation step) and a maximization step (aka modification step) in each iteration.

Note that the algorithm is not guaranteed to converge at the global maximum. But it has been found to yield good results in practice.

Next: Estimating the means of Up: Bayesian Learning 2: EM Previous: The EM Algorithm Anand Venkataraman
1999-09-16

of the results of the onspection the summary of The Merchant of Venice the closing of the day The phantom of the operadint the colour of the night the workshops of the plant. the phantom of the opera the ball of the thumb the structure of the egg the capital of the cities the principle of the telescope The question of the QQ: THE COLOURS OF THE WINDS the Lord of the Rings the center of the world the end of theclandin the sound of silence the food of shanghai Faith of The Heart THE VALUE OF FRIENDS the history of tennis the history of binoculars temptation of the wolf colors of the wind