The first few stages of the network are used to absorb the input sequence, and the associated output vectors are simply ignored. This part of the network can be viewed as an 'encoder' in which the entire input sentence has been compressed into the state z* of the hidden variable. The remaining network stages function as the 'decoder', which generates the translated sentence as output one word at a time. Notice that each output word is fed as input to the next stage of the network, and so this approach has an autoregressive structure analogous to (12.31).引自第381页