joeynmt package

Submodules

joeynmt.attention module

Attention modules

class joeynmt.attention.AttentionMechanism[source]

Bases: torch.nn.modules.module.Module

Base attention class

forward(*inputs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class joeynmt.attention.BahdanauAttention(hidden_size=1, key_size=1, query_size=1)[source]

Bases: joeynmt.attention.AttentionMechanism

Implements Bahdanau (MLP) attention

Section A.1.2 in https://arxiv.org/pdf/1409.0473.pdf.

compute_proj_keys(keys: torch.Tensor)[source]

Compute the projection of the keys. Is efficient if pre-computed before receiving individual queries.

Parameters:keys
Returns:
compute_proj_query(query: torch.Tensor)[source]

Compute the projection of the query.

Parameters:query
Returns:
forward(query: torch.Tensor = None, mask: torch.Tensor = None, values: torch.Tensor = None)[source]

Bahdanau MLP attention forward pass.

Parameters:
  • query – the item (decoder state) to compare with the keys/memory, shape (batch_size, 1, decoder.hidden_size)
  • mask – mask out keys position (0 in invalid positions, 1 else), shape (batch_size, 1, src_length)
  • values – values (encoder states), shape (batch_size, src_length, encoder.hidden_size)
Returns:

context vector of shape (batch_size, 1, value_size), attention probabilities of shape (batch_size, 1, src_length)

class joeynmt.attention.LuongAttention(hidden_size: int = 1, key_size: int = 1)[source]

Bases: joeynmt.attention.AttentionMechanism

Implements Luong (bilinear / multiplicative) attention.

Eq. 8 (“general”) in http://aclweb.org/anthology/D15-1166.

compute_proj_keys(keys: torch.Tensor)[source]

Compute the projection of the keys and assign them to self.proj_keys. This pre-computation is efficiently done for all keys before receiving individual queries.

Parameters:keys – shape (batch_size, src_length, encoder.hidden_size)
forward(query: torch.Tensor = None, mask: torch.Tensor = None, values: torch.Tensor = None)[source]

Luong (multiplicative / bilinear) attention forward pass. Computes context vectors and attention scores for a given query and all masked values and returns them.

Parameters:
  • query – the item (decoder state) to compare with the keys/memory, shape (batch_size, 1, decoder.hidden_size)
  • mask – mask out keys position (0 in invalid positions, 1 else), shape (batch_size, 1, src_length)
  • values – values (encoder states), shape (batch_size, src_length, encoder.hidden_size)
Returns:

context vector of shape (batch_size, 1, value_size), attention probabilities of shape (batch_size, 1, src_length)

joeynmt.batch module

Implementation of a mini-batch.

class joeynmt.batch.Batch(torch_batch, pad_index, use_cuda=False)[source]

Bases: object

Object for holding a batch of data with mask during training. Input is a batch from a torch text iterator.

sort_by_src_length()[source]

Sort by src length (descending) and return index to revert sort

Returns:

joeynmt.builders module

Collection of builder functions

class joeynmt.builders.NoamScheduler(hidden_size: int, optimizer: torch.optim.optimizer.Optimizer, factor: float = 1, warmup: int = 4000)[source]

Bases: object

The Noam learning rate scheduler used in “Attention is all you need” See Eq. 3 in https://arxiv.org/pdf/1706.03762.pdf

load_state_dict(state_dict)[source]

Given a state_dict, this function loads scheduler’s state

state_dict()[source]

Returns dictionary of values necessary to reconstruct scheduler

step()[source]

Update parameters and rate

class joeynmt.builders.WarmupExponentialDecayScheduler(optimizer: torch.optim.optimizer.Optimizer, peak_rate: float = 0.001, decay_length: int = 10000, warmup: int = 4000, decay_rate: float = 0.5, min_rate: float = 1e-05)[source]

Bases: object

A learning rate scheduler similar to Noam, but modified: Keep the warm up period but make it so that the decay rate can be tuneable. The decay is exponential up to a given minimum rate.

load_state_dict(state_dict)[source]

Given a state_dict, this function loads scheduler’s state

state_dict()[source]

Returns dictionary of values necessary to reconstruct scheduler

step()[source]

Update parameters and rate

joeynmt.builders.build_gradient_clipper(config: dict) → Optional[Callable][source]

Define the function for gradient clipping as specified in configuration. If not specified, returns None.

Current options:
  • “clip_grad_val”: clip the gradients if they exceed this value,
    see torch.nn.utils.clip_grad_value_
  • “clip_grad_norm”: clip the gradients if their norm exceeds this value,
    see torch.nn.utils.clip_grad_norm_
Parameters:config – dictionary with training configurations
Returns:clipping function (in-place) or None if no gradient clipping
joeynmt.builders.build_optimizer(config: dict, parameters: Generator) → torch.optim.optimizer.Optimizer[source]

Create an optimizer for the given parameters as specified in config.

Except for the weight decay and initial learning rate, default optimizer settings are used.

Currently supported configuration settings for “optimizer”:
  • “sgd” (default): see torch.optim.SGD
  • “adam”: see torch.optim.adam
  • “adagrad”: see torch.optim.adagrad
  • “adadelta”: see torch.optim.adadelta
  • “rmsprop”: see torch.optim.RMSprop

The initial learning rate is set according to “learning_rate” in the config. The weight decay is set according to “weight_decay” in the config. If they are not specified, the initial learning rate is set to 3.0e-4, the weight decay to 0.

Note that the scheduler state is saved in the checkpoint, so if you load a model for further training you have to use the same type of scheduler.

Parameters:
  • config – configuration dictionary
  • parameters
Returns:

optimizer

joeynmt.builders.build_scheduler(config: dict, optimizer: torch.optim.optimizer.Optimizer, scheduler_mode: str, hidden_size: int = 0) -> (typing.Union[torch.optim.lr_scheduler._LRScheduler, NoneType], typing.Union[str, NoneType])[source]

Create a learning rate scheduler if specified in config and determine when a scheduler step should be executed.

Current options:
  • “plateau”: see torch.optim.lr_scheduler.ReduceLROnPlateau
  • “decaying”: see torch.optim.lr_scheduler.StepLR
  • “exponential”: see torch.optim.lr_scheduler.ExponentialLR
  • “noam”: see joeynmt.builders.NoamScheduler
  • “warmupexponentialdecay”: see joeynmt.builders.WarmupExponentialDecayScheduler

If no scheduler is specified, returns (None, None) which will result in a constant learning rate.

Parameters:
  • config – training configuration
  • optimizer – optimizer for the scheduler, determines the set of parameters which the scheduler sets the learning rate for
  • scheduler_mode – “min” or “max”, depending on whether the validation score should be minimized or maximized. Only relevant for “plateau”.
  • hidden_size – encoder hidden size (required for NoamScheduler)
Returns:

  • scheduler: scheduler object,
  • scheduler_step_at: either “validation” or “epoch”

joeynmt.constants module

Defining global constants

joeynmt.constants.DEFAULT_UNK_ID()

joeynmt.data module

Data module

class joeynmt.data.MonoDataset(path: str, ext: str, field: torchtext.data.field.Field, **kwargs)[source]

Bases: torchtext.data.dataset.Dataset

Defines a dataset for machine translation without targets.

static sort_key(ex)[source]
joeynmt.data.load_data(data_cfg: dict, datasets: list = None) -> (torchtext.data.dataset.Dataset, torchtext.data.dataset.Dataset, typing.Union[torchtext.data.dataset.Dataset, NoneType], <class 'joeynmt.vocabulary.Vocabulary'>, <class 'joeynmt.vocabulary.Vocabulary'>)[source]

Load train, dev and optionally test data as specified in configuration. Vocabularies are created from the training set with a limit of voc_limit tokens and a minimum token frequency of voc_min_freq (specified in the configuration dictionary).

The training data is filtered to include sentences up to max_sent_length on source and target side.

If you set random_train_subset, a random selection of this size is used from the training set instead of the full training set.

Parameters:
  • data_cfg – configuration dictionary for data (“data” part of configuation file)
  • datasets – list of dataset names to load
Returns:

  • train_data: training dataset
  • dev_data: development dataset
  • test_data: testdata set if given, otherwise None
  • src_vocab: source vocabulary extracted from training data
  • trg_vocab: target vocabulary extracted from training data

joeynmt.data.make_data_iter(dataset: torchtext.data.dataset.Dataset, batch_size: int, batch_type: str = 'sentence', train: bool = False, shuffle: bool = False) → torchtext.data.iterator.Iterator[source]

Returns a torchtext iterator for a torchtext dataset.

Parameters:
  • dataset – torchtext dataset containing src and optionally trg
  • batch_size – size of the batches the iterator prepares
  • batch_type – measure batch size by sentence count or by token count
  • train – whether it’s training time, when turned off, bucketing, sorting within batches and shuffling is disabled
  • shuffle – whether to shuffle the data before each epoch (no effect if set to True for testing)
Returns:

torchtext iterator

joeynmt.data.token_batch_size_fn(new, count, sofar)[source]

Compute batch size based on number of tokens (+padding).

joeynmt.decoders module

Various decoders

class joeynmt.decoders.Decoder[source]

Bases: torch.nn.modules.module.Module

Base decoder class

output_size

Return the output size (size of the target vocabulary)

Returns:
class joeynmt.decoders.RecurrentDecoder(rnn_type: str = 'gru', emb_size: int = 0, hidden_size: int = 0, encoder: joeynmt.encoders.Encoder = None, attention: str = 'bahdanau', num_layers: int = 1, vocab_size: int = 0, dropout: float = 0.0, emb_dropout: float = 0.0, hidden_dropout: float = 0.0, init_hidden: str = 'bridge', input_feeding: bool = True, freeze: bool = False, **kwargs)[source]

Bases: joeynmt.decoders.Decoder

A conditional RNN decoder with attention.

forward(trg_embed: torch.Tensor, encoder_output: torch.Tensor, encoder_hidden: torch.Tensor, src_mask: torch.Tensor, unroll_steps: int, hidden: torch.Tensor = None, prev_att_vector: torch.Tensor = None, **kwargs) -> (<class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'torch.Tensor'>)[source]

Unroll the decoder one step at a time for unroll_steps steps. For every step, the _forward_step function is called internally.

During training, the target inputs (trg_embed’) are already known for the full sequence, so the full unrol is done. In this case, `hidden and prev_att_vector are None.

For inference, this function is called with one step at a time since embedded targets are the predictions from the previous time step. In this case, hidden and prev_att_vector are fed from the output of the previous call of this function (from the 2nd step on).

src_mask is needed to mask out the areas of the encoder states that should not receive any attention, which is everything after the first <eos>.

The encoder_output are the hidden states from the encoder and are used as context for the attention.

The encoder_hidden is the last encoder hidden state that is used to initialize the first hidden decoder state (when self.init_hidden_option is “bridge” or “last”).

Parameters:
  • trg_embed – embedded target inputs, shape (batch_size, trg_length, embed_size)
  • encoder_output – hidden states from the encoder, shape (batch_size, src_length, encoder.output_size)
  • encoder_hidden – last state from the encoder, shape (batch_size, encoder.output_size)
  • src_mask – mask for src states: 0s for padded areas, 1s for the rest, shape (batch_size, 1, src_length)
  • unroll_steps – number of steps to unroll the decoder RNN
  • hidden – previous decoder hidden state, if not given it’s initialized as in self.init_hidden, shape (batch_size, num_layers, hidden_size)
  • prev_att_vector – previous attentional vector, if not given it’s initialized with zeros, shape (batch_size, 1, hidden_size)
Returns:

  • outputs: shape (batch_size, unroll_steps, vocab_size),
  • hidden: last hidden state (batch_size, num_layers, hidden_size),
  • att_probs: attention probabilities
    with shape (batch_size, unroll_steps, src_length),
  • att_vectors: attentional vectors
    with shape (batch_size, unroll_steps, hidden_size)

class joeynmt.decoders.TransformerDecoder(num_layers: int = 4, num_heads: int = 8, hidden_size: int = 512, ff_size: int = 2048, dropout: float = 0.1, emb_dropout: float = 0.1, vocab_size: int = 1, freeze: bool = False, **kwargs)[source]

Bases: joeynmt.decoders.Decoder

A transformer decoder with N masked layers. Decoder layers are masked so that an attention head cannot see the future.

forward(trg_embed: torch.Tensor = None, encoder_output: torch.Tensor = None, encoder_hidden: torch.Tensor = None, src_mask: torch.Tensor = None, unroll_steps: int = None, hidden: torch.Tensor = None, trg_mask: torch.Tensor = None, **kwargs)[source]

Transformer decoder forward pass.

Parameters:
  • trg_embed – embedded targets
  • encoder_output – source representations
  • encoder_hidden – unused
  • src_mask
  • unroll_steps – unused
  • hidden – unused
  • trg_mask – to mask out target paddings Note that a subsequent mask is applied here.
  • kwargs
Returns:

joeynmt.embeddings module

Embedding module

class joeynmt.embeddings.Embeddings(embedding_dim: int = 64, scale: bool = False, vocab_size: int = 0, padding_idx: int = 1, freeze: bool = False, **kwargs)[source]

Bases: torch.nn.modules.module.Module

Simple embeddings class

forward(x: torch.Tensor) → torch.Tensor[source]

Perform lookup for input x in the embedding table.

Parameters:x – index in the vocabulary
Returns:embedded representation for x
load_from_file(embed_path: str, vocab: joeynmt.vocabulary.Vocabulary)[source]

Load pretrained embedding weights from text file.

  • First line is expected to contain vocabulary size and dimension. The dimension has to match the model’s specified embedding size, the vocabulary size is used in logging only.
  • Each line should contain word and embedding weights separated by spaces.
  • The pretrained vocabulary items that are not part of the joeynmt’s vocabulary will be ignored (not loaded from the file).
  • The initialization (specified in config[“model”][“embed_initializer”]) of joeynmt’s vocabulary items that are not part of the pretrained vocabulary will be kept (not overwritten in this func).
  • This function should be called after initialization!
Example:
2 5 the -0.0230 -0.0264 0.0287 0.0171 0.1403 at -0.0395 -0.1286 0.0275 0.0254 -0.0932
Parameters:
  • embed_path – embedding weights text file
  • vocab – Vocabulary object

joeynmt.encoders module

class joeynmt.encoders.Encoder[source]

Bases: torch.nn.modules.module.Module

Base encoder class

output_size

Return the output size

Returns:
class joeynmt.encoders.RecurrentEncoder(rnn_type: str = 'gru', hidden_size: int = 1, emb_size: int = 1, num_layers: int = 1, dropout: float = 0.0, emb_dropout: float = 0.0, bidirectional: bool = True, freeze: bool = False, **kwargs)[source]

Bases: joeynmt.encoders.Encoder

Encodes a sequence of word embeddings

forward(embed_src: torch.Tensor, src_length: torch.Tensor, mask: torch.Tensor, **kwargs) -> (<class 'torch.Tensor'>, <class 'torch.Tensor'>)[source]

Applies a bidirectional RNN to sequence of embeddings x. The input mini-batch x needs to be sorted by src length. x and mask should have the same dimensions [batch, time, dim].

Parameters:
  • embed_src – embedded src inputs, shape (batch_size, src_len, embed_size)
  • src_length – length of src inputs (counting tokens before padding), shape (batch_size)
  • mask – indicates padding areas (zeros where padding), shape (batch_size, src_len, embed_size)
Returns:

  • output: hidden states with
    shape (batch_size, max_length, directions*hidden),
  • hidden_concat: last hidden state with
    shape (batch_size, directions*hidden)

class joeynmt.encoders.TransformerEncoder(hidden_size: int = 512, ff_size: int = 2048, num_layers: int = 8, num_heads: int = 4, dropout: float = 0.1, emb_dropout: float = 0.1, freeze: bool = False, **kwargs)[source]

Bases: joeynmt.encoders.Encoder

Transformer Encoder

forward(embed_src: torch.Tensor, src_length: torch.Tensor, mask: torch.Tensor, **kwargs) -> (<class 'torch.Tensor'>, <class 'torch.Tensor'>)[source]

Pass the input (and mask) through each layer in turn. Applies a Transformer encoder to sequence of embeddings x. The input mini-batch x needs to be sorted by src length. x and mask should have the same dimensions [batch, time, dim].

Parameters:
  • embed_src – embedded src inputs, shape (batch_size, src_len, embed_size)
  • src_length – length of src inputs (counting tokens before padding), shape (batch_size)
  • mask – indicates padding areas (zeros where padding), shape (batch_size, 1, src_len)
Returns:

  • output: hidden states with
    shape (batch_size, max_length, directions*hidden),
  • hidden_concat: last hidden state with
    shape (batch_size, directions*hidden)

joeynmt.helpers module

Collection of helper functions

exception joeynmt.helpers.ConfigurationError[source]

Bases: Exception

Custom exception for misspecifications of configuration

joeynmt.helpers.bpe_postprocess(string, bpe_type='subword-nmt') → str[source]

Post-processor for BPE output. Recombines BPE-split tokens.

Parameters:
  • string
  • bpe_type – one of {“sentencepiece”, “subword-nmt”}
Returns:

post-processed string

joeynmt.helpers.clones(module: torch.nn.modules.module.Module, n: int) → torch.nn.modules.container.ModuleList[source]

Produce N identical layers. Transformer helper function.

Parameters:
  • module – the module to clone
  • n – clone this many times

:return cloned modules

joeynmt.helpers.freeze_params(module: torch.nn.modules.module.Module) → None[source]

Freeze the parameters of this module, i.e. do not update them during training

Parameters:module – freeze parameters of this module
joeynmt.helpers.get_latest_checkpoint(ckpt_dir: str) → Optional[str][source]

Returns the latest checkpoint (by time) from the given directory. If there is no checkpoint in this directory, returns None

Parameters:ckpt_dir
Returns:latest checkpoint file
joeynmt.helpers.latest_checkpoint_update(target: pathlib.Path, link_name: str) → Optional[pathlib.Path][source]

This function finds the file that the symlink currently points to, sets it to the new target, and returns the previous target if it exists.

Parameters:
  • target – A path to a file that we want the symlink to point to.
  • link_name – This is the name of the symlink that we want to update.
Returns:

  • current_last: This is the previous target of the symlink, before it is
    updated in this function. If the symlink did not exist before or did not have a target, None is returned instead.

joeynmt.helpers.load_checkpoint(path: str, use_cuda: bool = True) → dict[source]

Load model from saved checkpoint.

Parameters:
  • path – path to checkpoint
  • use_cuda – using cuda or not
Returns:

checkpoint (dict)

joeynmt.helpers.load_config(path='configs/default.yaml') → dict[source]

Loads and parses a YAML configuration file.

Parameters:path – path to YAML configuration file
Returns:configuration dictionary
joeynmt.helpers.log_cfg(cfg: dict, prefix: str = 'cfg') → None[source]

Write configuration to log.

Parameters:
  • cfg – configuration to log
  • prefix – prefix for logging
joeynmt.helpers.log_data_info(train_data: torchtext.data.dataset.Dataset, valid_data: torchtext.data.dataset.Dataset, test_data: torchtext.data.dataset.Dataset, src_vocab: joeynmt.vocabulary.Vocabulary, trg_vocab: joeynmt.vocabulary.Vocabulary) → None[source]

Log statistics of data and vocabulary.

Parameters:
  • train_data
  • valid_data
  • test_data
  • src_vocab
  • trg_vocab
joeynmt.helpers.make_logger(log_dir: str = None, mode: str = 'train') → str[source]

Create a logger for logging the training/testing process.

Parameters:
  • log_dir – path to file where log is stored as well
  • mode – log file name. ‘train’, ‘test’ or ‘translate’
Returns:

joeynmt version number

joeynmt.helpers.make_model_dir(model_dir: str, overwrite=False) → str[source]

Create a new directory for the model.

Parameters:
  • model_dir – path to model directory
  • overwrite – whether to overwrite an existing directory
Returns:

path to model directory

joeynmt.helpers.set_seed(seed: int) → None[source]

Set the random seed for modules torch, numpy and random.

Parameters:seed – random seed
joeynmt.helpers.store_attention_plots(attentions: numpy.array, targets: List[List[str]], sources: List[List[str]], output_prefix: str, indices: List[int], tb_writer: Optional[torch.utils.tensorboard.writer.SummaryWriter] = None, steps: int = 0) → None[source]

Saves attention plots.

Parameters:
  • attentions – attention scores
  • targets – list of tokenized targets
  • sources – list of tokenized sources
  • output_prefix – prefix for attention plots
  • indices – indices selected for plotting
  • tb_writer – Tensorboard summary writer (optional)
  • steps – current training steps, needed for tb_writer
  • dpi – resolution for images
joeynmt.helpers.subsequent_mask(size: int) → torch.Tensor[source]

Mask out subsequent positions (to prevent attending to future positions) Transformer helper function.

Parameters:size – size of mask (2nd and 3rd dim)
Returns:Tensor with 0s and 1s of shape (1, size, size)
joeynmt.helpers.tile(x: torch.Tensor, count: int, dim=0) → torch.Tensor[source]

Tiles x on dimension dim count times. From OpenNMT. Used for beam search.

Parameters:
  • x – tensor to tile
  • count – number of tiles
  • dim – dimension along which the tensor is tiled
Returns:

tiled tensor

joeynmt.initialization module

Implements custom initialization

joeynmt.initialization.initialize_model(model: torch.nn.modules.module.Module, cfg: dict, src_padding_idx: int, trg_padding_idx: int) → None[source]

This initializes a model based on the provided config.

All initializer configuration is part of the model section of the configuration file. For an example, see e.g. https://github.com/joeynmt/joeynmt/ blob/master/configs/iwslt_envi_xnmt.yaml#L47

The main initializer is set using the initializer key. Possible values are xavier, uniform, normal or zeros. (xavier is the default).

When an initializer is set to uniform, then init_weight sets the range for the values (-init_weight, init_weight).

When an initializer is set to normal, then init_weight sets the standard deviation for the weights (with mean 0).

The word embedding initializer is set using embed_initializer and takes the same values. The default is normal with embed_init_weight = 0.01.

Biases are initialized separately using bias_initializer. The default is zeros, but you can use the same initializers as the main initializer.

Set init_rnn_orthogonal to True if you want RNN orthogonal initialization (for recurrent matrices). Default is False.

lstm_forget_gate controls how the LSTM forget gate is initialized. Default is 1.

Parameters:
  • model – model to initialize
  • cfg – the model configuration
  • src_padding_idx – index of source padding token
  • trg_padding_idx – index of target padding token
joeynmt.initialization.lstm_forget_gate_init_(cell: torch.nn.modules.rnn.RNNBase, value: float = 1.0) → None[source]

Initialize LSTM forget gates with value.

Parameters:
  • cell – LSTM cell
  • value – initial value, default: 1
joeynmt.initialization.orthogonal_rnn_init_(cell: torch.nn.modules.rnn.RNNBase, gain: float = 1.0)[source]

Orthogonal initialization of recurrent weights RNN parameters contain 3 or 4 matrices in one parameter, so we slice it.

joeynmt.initialization.xavier_uniform_n_(w: torch.Tensor, gain: float = 1.0, n: int = 4) → None[source]

Xavier initializer for parameters that combine multiple matrices in one parameter for efficiency. This is e.g. used for GRU and LSTM parameters, where e.g. all gates are computed at the same time by 1 big matrix.

Parameters:
  • w – parameter
  • gain – default 1
  • n – default 4

joeynmt.metrics module

This module holds various MT evaluation metrics.

joeynmt.metrics.bleu(hypotheses, references, tokenize='13a')[source]

Raw corpus BLEU from sacrebleu (without tokenization)

Parameters:
  • hypotheses – list of hypotheses (strings)
  • references – list of references (strings)
  • tokenize – one of {‘none’, ‘13a’, ‘intl’, ‘zh’, ‘ja-mecab’}
Returns:

joeynmt.metrics.chrf(hypotheses, references, remove_whitespace=True)[source]

Character F-score from sacrebleu

Parameters:
  • hypotheses – list of hypotheses (strings)
  • references – list of references (strings)
  • remove_whitespace – (bool)
Returns:

joeynmt.metrics.sequence_accuracy(hypotheses, references)[source]

Compute the accuracy of hypothesis tokens: correct tokens / all tokens Tokens are correct if they appear in the same position in the reference.

Parameters:
  • hypotheses – list of hypotheses (strings)
  • references – list of references (strings)
Returns:

joeynmt.metrics.token_accuracy(hypotheses: List[List[str]], references: List[List[str]]) → float[source]

Compute the accuracy of hypothesis tokens: correct tokens / all tokens Tokens are correct if they appear in the same position in the reference.

Parameters:
  • hypotheses – list of tokenized hypotheses (List[List[str]])
  • references – list of tokenized references (List[List[str]])
Returns:

token accuracy (float)

joeynmt.model module

Module to represents whole models

class joeynmt.model.Model(encoder: joeynmt.encoders.Encoder, decoder: joeynmt.decoders.Decoder, src_embed: joeynmt.embeddings.Embeddings, trg_embed: joeynmt.embeddings.Embeddings, src_vocab: joeynmt.vocabulary.Vocabulary, trg_vocab: joeynmt.vocabulary.Vocabulary)[source]

Bases: torch.nn.modules.module.Module

Base Model class

forward(return_type: str = None, **kwargs) -> (<class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'torch.Tensor'>)[source]

Interface for multi-gpu

For DataParallel, We need to encapsulate all model call: model.encode(), model.decode(), and model.encode_decode() by model.__call__(). model.__call__() triggers model.forward() together with pre hooks and post hooks, which take care of multi-gpu distribution.

Parameters:return_type – one of {“loss”, “encode”, “decode”}
loss_function
joeynmt.model.build_model(cfg: dict = None, src_vocab: joeynmt.vocabulary.Vocabulary = None, trg_vocab: joeynmt.vocabulary.Vocabulary = None) → joeynmt.model.Model[source]

Build and initialize the model according to the configuration.

Parameters:
  • cfg – dictionary configuration containing model specifications
  • src_vocab – source vocabulary
  • trg_vocab – target vocabulary
Returns:

built and initialized model

joeynmt.plotting module

joeynmt.plotting.plot_heatmap(scores: numpy.array, column_labels: List[str], row_labels: List[str], output_path: Optional[str] = None, dpi: int = 300) → matplotlib.figure.Figure[source]

Plotting function that can be used to visualize (self-)attention. Plots are saved if output_path is specified, in format that this file ends with (‘pdf’ or ‘png’).

Parameters:
  • scores – attention scores
  • column_labels – labels for columns (e.g. target tokens)
  • row_labels – labels for rows (e.g. source tokens)
  • output_path – path to save to
  • dpi – set resolution for matplotlib
Returns:

pyplot figure

joeynmt.prediction module

This modules holds methods for generating predictions from a model.

joeynmt.prediction.parse_test_args(cfg, mode='test')[source]

parse test args :param cfg: config object :param mode: ‘test’ or ‘translate’ :return:

joeynmt.prediction.test(cfg_file, ckpt: str, batch_class: joeynmt.batch.Batch = <class 'joeynmt.batch.Batch'>, output_path: str = None, save_attention: bool = False, datasets: dict = None) → None[source]

Main test function. Handles loading a model from checkpoint, generating translations and storing them and attention plots.

Parameters:
  • cfg_file – path to configuration file
  • ckpt – path to checkpoint to load
  • batch_class – class type of batch
  • output_path – path to output
  • datasets – datasets to predict
  • save_attention – whether to save the computed attention weights
joeynmt.prediction.translate(cfg_file: str, ckpt: str, output_path: str = None, batch_class: joeynmt.batch.Batch = <class 'joeynmt.batch.Batch'>, n_best: int = 1) → None[source]

Interactive translation function. Loads model from checkpoint and translates either the stdin input or asks for input to translate interactively. The input has to be pre-processed according to the data that the model was trained on, i.e. tokenized or split into subwords. Translations are printed to stdout.

Parameters:
  • cfg_file – path to configuration file
  • ckpt – path to checkpoint to load
  • output_path – path to output file
  • batch_class – class type of batch
  • n_best – amount of candidates to display
joeynmt.prediction.validate_on_data(model: joeynmt.model.Model, data: torchtext.data.dataset.Dataset, batch_size: int, use_cuda: bool, max_output_length: int, level: str, eval_metric: Optional[str], n_gpu: int, batch_class: joeynmt.batch.Batch = <class 'joeynmt.batch.Batch'>, compute_loss: bool = False, beam_size: int = 1, beam_alpha: int = -1, batch_type: str = 'sentence', postprocess: bool = True, bpe_type: str = 'subword-nmt', sacrebleu: dict = None, n_best: int = 1) -> (<class 'float'>, <class 'float'>, <class 'float'>, typing.List[str], typing.List[typing.List[str]], typing.List[str], typing.List[str], typing.List[typing.List[str]], typing.List[<built-in function array>])[source]

Generate translations for the given data. If compute_loss is True and references are given, also compute the loss.

Parameters:
  • model – model module
  • data – dataset for validation
  • batch_size – validation batch size
  • batch_class – class type of batch
  • use_cuda – if True, use CUDA
  • max_output_length – maximum length for generated hypotheses
  • level – segmentation level, one of “char”, “bpe”, “word”
  • eval_metric – evaluation metric, e.g. “bleu”
  • n_gpu – number of GPUs
  • compute_loss – whether to computes a scalar loss for given inputs and targets
  • beam_size – beam size for validation. If <2 then greedy decoding (default).
  • beam_alpha – beam search alpha for length penalty, disabled if set to -1 (default).
  • batch_type – validation batch type (sentence or token)
  • postprocess – if True, remove BPE segmentation from translations
  • bpe_type – bpe type, one of {“subword-nmt”, “sentencepiece”}
  • sacrebleu – sacrebleu options
  • n_best – Amount of candidates to return
Returns:

  • current_valid_score: current validation score [eval_metric],
  • valid_loss: validation loss,
  • valid_ppl:, validation perplexity,
  • valid_sources: validation sources,
  • valid_sources_raw: raw validation sources (before post-processing),
  • valid_references: validation references,
  • valid_hypotheses: validation_hypotheses,
  • decoded_valid: raw validation hypotheses (before post-processing),
  • valid_attention_scores: attention scores for validation hypotheses

joeynmt.search module

joeynmt.search.greedy(src_mask: torch.Tensor, max_output_length: int, model: joeynmt.model.Model, encoder_output: torch.Tensor, encoder_hidden: torch.Tensor) -> (<built-in function array>, <built-in function array>)[source]

Greedy decoding. Select the token word highest probability at each time step. This function is a wrapper that calls recurrent_greedy for recurrent decoders and transformer_greedy for transformer decoders.

Parameters:
  • src_mask – mask for source inputs, 0 for positions after </s>
  • max_output_length – maximum length for the hypotheses
  • model – model to use for greedy decoding
  • encoder_output – encoder hidden states for attention
  • encoder_hidden – encoder last state for decoder initialization
Returns:

joeynmt.search.transformer_greedy(src_mask: torch.Tensor, max_output_length: int, model: joeynmt.model.Model, encoder_output: torch.Tensor, encoder_hidden: torch.Tensor) -> (<built-in function array>, None)[source]

Special greedy function for transformer, since it works differently. The transformer remembers all previous states and attends to them.

Parameters:
  • src_mask – mask for source inputs, 0 for positions after </s>
  • max_output_length – maximum length for the hypotheses
  • model – model to use for greedy decoding
  • encoder_output – encoder hidden states for attention
  • encoder_hidden – encoder final state (unused in Transformer)
Returns:

  • stacked_output: output hypotheses (2d array of indices),
  • stacked_attention_scores: attention scores (3d array)

Beam search with size k. Inspired by OpenNMT-py, adapted for Transformer. In each decoding step, find the k most likely partial hypotheses. :param model: :param size: size of the beam :param encoder_output: :param encoder_hidden: :param src_mask: :param max_output_length: :param alpha: alpha factor for length penalty :param n_best: return this many hypotheses, <= beam (currently only 1) :return:

  • stacked_output: output hypotheses (2d array of indices),
  • stacked_attention_scores: attention scores (3d array)
joeynmt.search.run_batch(model: joeynmt.model.Model, batch: joeynmt.batch.Batch, max_output_length: int, beam_size: int, beam_alpha: float, n_best: int = 1) -> (<built-in function array>, <built-in function array>)[source]

Get outputs and attentions scores for a given batch

Parameters:
  • model – Model class
  • batch – batch to generate hypotheses for
  • max_output_length – maximum length of hypotheses
  • beam_size – size of the beam for beam search, if 0 use greedy
  • beam_alpha – alpha value for beam search
  • n_best – candidates to return
Returns:

stacked_output: hypotheses for batch, stacked_attention_scores: attention scores for batch

joeynmt.training module

Training module

class joeynmt.training.TrainManager(model: joeynmt.model.Model, config: dict, batch_class: joeynmt.batch.Batch = <class 'joeynmt.batch.Batch'>)[source]

Bases: object

Manages training loop, validations, learning rate scheduling and early stopping.

class TrainStatistics(steps: int = 0, stop: bool = False, total_tokens: int = 0, best_ckpt_iter: int = 0, best_ckpt_score: float = inf, minimize_metric: bool = True)[source]

Bases: object

is_best(score)[source]
init_from_checkpoint(path: str, reset_best_ckpt: bool = False, reset_scheduler: bool = False, reset_optimizer: bool = False, reset_iter_state: bool = False) → None[source]

Initialize the trainer from a given checkpoint file.

This checkpoint file contains not only model parameters, but also scheduler and optimizer states, see self._save_checkpoint.

Parameters:
  • path – path to checkpoint
  • reset_best_ckpt – reset tracking of the best checkpoint, use for domain adaptation with a new dev set or when using a new metric for fine-tuning.
  • reset_scheduler – reset the learning rate scheduler, and do not use the one stored in the checkpoint.
  • reset_optimizer – reset the optimizer, and do not use the one stored in the checkpoint.
  • reset_iter_state – reset the sampler’s internal state and do not use the one stored in the checkpoint.
train_and_validate(train_data: torchtext.data.dataset.Dataset, valid_data: torchtext.data.dataset.Dataset) → None[source]

Train the model and validate it from time to time on the validation set.

Parameters:
  • train_data – training data
  • valid_data – validation data
joeynmt.training.train(cfg_file: str) → None[source]

Main training function. After training, also test on test data if given.

Parameters:cfg_file – path to configuration yaml file

joeynmt.vocabulary module

Vocabulary module

class joeynmt.vocabulary.Vocabulary(tokens: List[str] = None, file: str = None)[source]

Bases: object

Vocabulary represents mapping between tokens and indices.

add_tokens(tokens: List[str]) → None[source]

Add list of tokens to vocabulary

Parameters:tokens – list of tokens to add to the vocabulary
array_to_sentence(array: numpy.array, cut_at_eos=True, skip_pad=True) → List[str][source]

Converts an array of IDs to a sentence, optionally cutting the result off at the end-of-sequence token.

Parameters:
  • array – 1D array containing indices
  • cut_at_eos – cut the decoded sentences at the first <eos>
  • skip_pad – skip generated <pad> tokens
Returns:

list of strings (tokens)

arrays_to_sentences(arrays: numpy.array, cut_at_eos=True, skip_pad=True) → List[List[str]][source]

Convert multiple arrays containing sequences of token IDs to their sentences, optionally cutting them off at the end-of-sequence token.

Parameters:
  • arrays – 2D array containing indices
  • cut_at_eos – cut the decoded sentences at the first <eos>
  • skip_pad – skip generated <pad> tokens
Returns:

list of list of strings (tokens)

is_unk(token: str) → bool[source]

Check whether a token is covered by the vocabulary

Parameters:token
Returns:True if covered, False otherwise
to_file(file: str) → None[source]

Save the vocabulary to a file, by writing token with index i in line i.

Parameters:file – path to file where the vocabulary is written
joeynmt.vocabulary.build_vocab(field: str, max_size: int, min_freq: int, dataset: torchtext.data.dataset.Dataset, vocab_file: str = None) → joeynmt.vocabulary.Vocabulary[source]

Builds vocabulary for a torchtext field from given`dataset` or vocab_file.

Parameters:
  • field – attribute e.g. “src”
  • max_size – maximum size of vocabulary
  • min_freq – minimum frequency for an item to be included
  • dataset – dataset to load data for field from
  • vocab_file – file to store the vocabulary, if not None, load vocabulary from here
Returns:

Vocabulary created from either dataset or vocab_file

joeynmt.loss module

Module to implement training loss

class joeynmt.loss.XentLoss(pad_index: int, smoothing: float = 0.0)[source]

Bases: torch.nn.modules.module.Module

Cross-Entropy Loss with optional label smoothing

forward(log_probs, targets)[source]

Compute the cross-entropy between logits and targets.

If label smoothing is used, target distributions are not one-hot, but “1-smoothing” for the correct target token and the rest of the probability mass is uniformly spread across the other tokens.

Parameters:
  • log_probs – log probabilities as predicted by model
  • targets – target indices
Returns:

joeynmt.transformer_layers module

class joeynmt.transformer_layers.MultiHeadedAttention(num_heads: int, size: int, dropout: float = 0.1)[source]

Bases: torch.nn.modules.module.Module

Multi-Head Attention module from “Attention is All You Need”

Implementation modified from OpenNMT-py. https://github.com/OpenNMT/OpenNMT-py

forward(k: torch.Tensor, v: torch.Tensor, q: torch.Tensor, mask: torch.Tensor = None)[source]

Computes multi-headed attention.

Parameters:
  • k – keys [B, M, D] with M being the sentence length.
  • v – values [B, M, D]
  • q – query [B, M, D]
  • mask – optional mask [B, 1, M]
Returns:

class joeynmt.transformer_layers.PositionalEncoding(size: int = 0, max_len: int = 5000)[source]

Bases: torch.nn.modules.module.Module

Pre-compute position encodings (PE). In forward pass, this adds the position-encodings to the input for as many time steps as necessary.

Implementation based on OpenNMT-py. https://github.com/OpenNMT/OpenNMT-py

forward(emb)[source]

Embed inputs. Args:

emb (FloatTensor): Sequence of word vectors
(seq_len, batch_size, self.dim)
class joeynmt.transformer_layers.PositionwiseFeedForward(input_size, ff_size, dropout=0.1)[source]

Bases: torch.nn.modules.module.Module

Position-wise Feed-forward layer Projects to ff_size and then back down to input_size.

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class joeynmt.transformer_layers.TransformerDecoderLayer(size: int = 0, ff_size: int = 0, num_heads: int = 0, dropout: float = 0.1)[source]

Bases: torch.nn.modules.module.Module

Transformer decoder layer.

Consists of self-attention, source-attention, and feed-forward.

forward(x: torch.Tensor = None, memory: torch.Tensor = None, src_mask: torch.Tensor = None, trg_mask: torch.Tensor = None) → torch.Tensor[source]

Forward pass of a single Transformer decoder layer.

Parameters:
  • x – inputs
  • memory – source representations
  • src_mask – source mask
  • trg_mask – target mask (so as to not condition on future steps)
Returns:

output tensor

class joeynmt.transformer_layers.TransformerEncoderLayer(size: int = 0, ff_size: int = 0, num_heads: int = 0, dropout: float = 0.1)[source]

Bases: torch.nn.modules.module.Module

One Transformer encoder layer has a Multi-head attention layer plus a position-wise feed-forward layer.

forward(x: torch.Tensor, mask: torch.Tensor) → torch.Tensor[source]

Forward pass for a single transformer encoder layer. First applies layer norm, then self attention, then dropout with residual connection (adding the input to the result), and then a position-wise feed-forward layer.

Parameters:
  • x – layer input
  • mask – input mask
Returns:

output tensor

Module contents