Neural Machine Translation using Encoder-Decoder

Vectorize words (or characters if you are working on character sequences) to encode into numerical respresentation. The training data usually has pairs of source and target sequences $<S, T>$ ordered by time step and grouped into batches.
Feeding input sequences in a reversed order would help improve training as that will make beginning of sentence to be fed last to the encoder, but that would be first the decoder needs to translate. Note that only the source (for example, Engish sentence) sequences are reversed but the target sequences (for example, French) are fed in their natural order.
Rather than having a simple one-hot encoding representation for the words, it would be a great uplift for the model if we add an embedding layer that projects words into a low-dimensional space that retains semantic properties of the language. For instance, the words “King”, “Queen” would lie closer in the emdedding space.
The input embeddings are then fed into encoder, which is typically from a class of RNNs.
The final state outputted from the encoder is fed as input to the decoder net. In addition to the encoder’s output state, target sequences are fed to the decoder. The decoder will output a score for each word in the target vocabulary, which is converted into a porbability by a softmax layer. Since the probabilities are generated for the entire target language vocabulary, the word with the highest probability is considered.
It might have popped up in your head that how the decoder works during inference time since the targets will not be available. Well, that is a good question to ask. At the time of inference, the decoder is fed with previously predicted word at each time step and with ”<SOS>” at the first step. So, at $t_0$ we feed “<SOS>”, at $t_1$ we feed output of $t_0$ as input, and so on..
So far so good. But, the output and input sequence may not have same length and can vary across the samples. Masking is one solution which involves cropping extra words after a certain length. It leads to incomplete translations and not advisable. The better alternative for handling variable sequence lengths is to sentences into buckets with each bucket holding sentences of similar sizes (eg., a bucket for sentences u to 6 words, another one having sentences from 7 to 15 words). Then apply padding (fill gaps with special tokens such as <pad>) in each bucket so that all sequences in a bucket have same length.
It is not efficient to output probabilities for all the words in the vocabulary, instead employ a sampling approach. Sample softmax loss can be employed during training.
TrainingSampler and ScheduledEmbeddingTrainingSample are available in Tensorflow Addons.
What is a TimeDistributed layer in keras?
Dimesions in a RNN layer Weights: $W_x$ for inputs and $W_y$ for outputs of the previous time step

$Input$	$(batch, num_steps, feature_dimensions$)
$W_{xh}$	$(feature_dimensions,hidden)$
$Hidden$	$(batch, hidden)$
$W_{hh}$	$(hidden,hidden)$
$W_{hq}$	$(hidden,output)$
$Output$	$(batch,output)$

How can a RNN handle variable length sequences?
How many parameters does an RNN need to learn?
How can we improve training time?
How to improve covergence of an RNN model?