joeynmt package¶

Submodules¶

joeynmt.attention module¶

Attention modules

class joeynmt.attention.AttentionMechanism[source]¶

Bases: torch.nn.modules.module.Module

Base attention class

forward(*inputs)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class joeynmt.attention.BahdanauAttention(hidden_size=1, key_size=1, query_size=1)[source]¶

Bases: joeynmt.attention.AttentionMechanism

Implements Bahdanau (MLP) attention

Section A.1.2 in https://arxiv.org/pdf/1409.0473.pdf.

compute_proj_keys(keys: torch.Tensor)[source]¶

Compute the projection of the keys. Is efficient if pre-computed before receiving individual queries.

Parameters:	keys –
Returns:

compute_proj_query(query: torch.Tensor)[source]¶

Compute the projection of the query.

Parameters:	query –
Returns:

forward(query: torch.Tensor = None, mask: torch.Tensor = None, values: torch.Tensor = None)[source]¶

Bahdanau MLP attention forward pass.

Parameters:	query – the item (decoder state) to compare with the keys/memory, shape (batch_size, 1, decoder.hidden_size) mask – mask out keys position (0 in invalid positions, 1 else), shape (batch_size, 1, src_length) values – values (encoder states), shape (batch_size, src_length, encoder.hidden_size)
Returns:	context vector of shape (batch_size, 1, value_size), attention probabilities of shape (batch_size, 1, src_length)

class joeynmt.attention.LuongAttention(hidden_size: int = 1, key_size: int = 1)[source]¶

Bases: joeynmt.attention.AttentionMechanism

Implements Luong (bilinear / multiplicative) attention.

Eq. 8 (“general”) in http://aclweb.org/anthology/D15-1166.

compute_proj_keys(keys: torch.Tensor)[source]¶

Compute the projection of the keys and assign them to self.proj_keys. This pre-computation is efficiently done for all keys before receiving individual queries.

Parameters:	keys – shape (batch_size, src_length, encoder.hidden_size)

forward(query: torch.Tensor = None, mask: torch.Tensor = None, values: torch.Tensor = None)[source]¶

Luong (multiplicative / bilinear) attention forward pass. Computes context vectors and attention scores for a given query and all masked values and returns them.

Parameters:	query – the item (decoder state) to compare with the keys/memory, shape (batch_size, 1, decoder.hidden_size) mask – mask out keys position (0 in invalid positions, 1 else), shape (batch_size, 1, src_length) values – values (encoder states), shape (batch_size, src_length, encoder.hidden_size)
Returns:	context vector of shape (batch_size, 1, value_size), attention probabilities of shape (batch_size, 1, src_length)

joeynmt.batch module¶

Implementation of a mini-batch.

class joeynmt.batch.Batch(torch_batch, pad_index, use_cuda=False)[source]¶

Bases: object

Object for holding a batch of data with mask during training. Input is a batch from a torch text iterator.

sort_by_src_length()[source]¶

Sort by src length (descending) and return index to revert sort

Returns:

joeynmt.builders module¶

Collection of builder functions

class joeynmt.builders.NoamScheduler(hidden_size: int, optimizer: torch.optim.optimizer.Optimizer, factor: float = 1, warmup: int = 4000)[source]¶

Bases: object

The Noam learning rate scheduler used in “Attention is all you need” See Eq. 3 in https://arxiv.org/pdf/1706.03762.pdf

load_state_dict(state_dict)[source]¶: Given a state_dict, this function loads scheduler’s state

state_dict()[source]¶: Returns dictionary of values necessary to reconstruct scheduler

step()[source]¶: Update parameters and rate

class joeynmt.builders.WarmupExponentialDecayScheduler(optimizer: torch.optim.optimizer.Optimizer, peak_rate: float = 0.001, decay_length: int = 10000, warmup: int = 4000, decay_rate: float = 0.5, min_rate: float = 1e-05)[source]¶

Bases: object

A learning rate scheduler similar to Noam, but modified: Keep the warm up period but make it so that the decay rate can be tuneable. The decay is exponential up to a given minimum rate.

load_state_dict(state_dict)[source]¶: Given a state_dict, this function loads scheduler’s state

state_dict()[source]¶: Returns dictionary of values necessary to reconstruct scheduler

step()[source]¶: Update parameters and rate

joeynmt.builders.build_gradient_clipper(config: dict) → Optional[Callable][source]¶

Define the function for gradient clipping as specified in configuration. If not specified, returns None.

Current options:

“clip_grad_val”: clip the gradients if they exceed this value,

see torch.nn.utils.clip_grad_value_
“clip_grad_norm”: clip the gradients if their norm exceeds this value,

see torch.nn.utils.clip_grad_norm_

Parameters:	config – dictionary with training configurations
Returns:	clipping function (in-place) or None if no gradient clipping

joeynmt.builders.build_optimizer(config: dict, parameters: Generator) → torch.optim.optimizer.Optimizer[source]¶

Create an optimizer for the given parameters as specified in config.

Except for the weight decay and initial learning rate, default optimizer settings are used.

Currently supported configuration settings for “optimizer”:

“sgd” (default): see torch.optim.SGD
“adam”: see torch.optim.adam
“adagrad”: see torch.optim.adagrad
“adadelta”: see torch.optim.adadelta
“rmsprop”: see torch.optim.RMSprop

The initial learning rate is set according to “learning_rate” in the config. The weight decay is set according to “weight_decay” in the config. If they are not specified, the initial learning rate is set to 3.0e-4, the weight decay to 0.

Note that the scheduler state is saved in the checkpoint, so if you load a model for further training you have to use the same type of scheduler.

Parameters:	config – configuration dictionary parameters –
Returns:	optimizer

joeynmt.builders.build_scheduler(config: dict, optimizer: torch.optim.optimizer.Optimizer, scheduler_mode: str, hidden_size: int = 0) -> (typing.Union[torch.optim.lr_scheduler._LRScheduler, NoneType], typing.Union[str, NoneType])[source]¶

Create a learning rate scheduler if specified in config and determine when a scheduler step should be executed.

Current options:

“plateau”: see torch.optim.lr_scheduler.ReduceLROnPlateau
“decaying”: see torch.optim.lr_scheduler.StepLR
“exponential”: see torch.optim.lr_scheduler.ExponentialLR
“noam”: see joeynmt.builders.NoamScheduler
“warmupexponentialdecay”: see joeynmt.builders.WarmupExponentialDecayScheduler

If no scheduler is specified, returns (None, None) which will result in a constant learning rate.

Parameters:

config – training configuration
optimizer – optimizer for the scheduler, determines the set of parameters which the scheduler sets the learning rate for
scheduler_mode – “min” or “max”, depending on whether the validation score should be minimized or maximized. Only relevant for “plateau”.
hidden_size – encoder hidden size (required for NoamScheduler)

Returns:

scheduler: scheduler object,
scheduler_step_at: either “validation” or “epoch”

joeynmt.constants module¶

Defining global constants

joeynmt.constants.DEFAULT_UNK_ID()¶

joeynmt.data module¶

Data module

class joeynmt.data.MonoDataset(path: str, ext: str, field: torchtext.data.field.Field, **kwargs)[source]¶

Bases: torchtext.data.dataset.Dataset

Defines a dataset for machine translation without targets.

static sort_key(ex)[source]¶

joeynmt.data.load_data(data_cfg: dict, datasets: list = None) -> (torchtext.data.dataset.Dataset, torchtext.data.dataset.Dataset, typing.Union[torchtext.data.dataset.Dataset, NoneType], <class 'joeynmt.vocabulary.Vocabulary'>, <class 'joeynmt.vocabulary.Vocabulary'>)[source]¶

Load train, dev and optionally test data as specified in configuration. Vocabularies are created from the training set with a limit of voc_limit tokens and a minimum token frequency of voc_min_freq (specified in the configuration dictionary).

The training data is filtered to include sentences up to max_sent_length on source and target side.

If you set random_train_subset, a random selection of this size is used from the training set instead of the full training set.

Parameters:

data_cfg – configuration dictionary for data (“data” part of configuation file)
datasets – list of dataset names to load

Returns:

train_data: training dataset
dev_data: development dataset
test_data: testdata set if given, otherwise None
src_vocab: source vocabulary extracted from training data
trg_vocab: target vocabulary extracted from training data

joeynmt.data.make_data_iter(dataset: torchtext.data.dataset.Dataset, batch_size: int, batch_type: str = 'sentence', train: bool = False, shuffle: bool = False) → torchtext.data.iterator.Iterator[source]¶

Returns a torchtext iterator for a torchtext dataset.

Parameters:

dataset – torchtext dataset containing src and optionally trg
batch_size – size of the batches the iterator prepares
batch_type – measure batch size by sentence count or by token count
train – whether it’s training time, when turned off, bucketing, sorting within batches and shuffling is disabled
shuffle – whether to shuffle the data before each epoch (no effect if set to True for testing)

Returns:

torchtext iterator

joeynmt.data.token_batch_size_fn(new, count, sofar)[source]¶: Compute batch size based on number of tokens (+padding).

joeynmt.decoders module¶

Various decoders

class joeynmt.decoders.Decoder[source]¶

Bases: torch.nn.modules.module.Module

Base decoder class

output_size¶

Return the output size (size of the target vocabulary)

Returns:

class joeynmt.decoders.RecurrentDecoder(rnn_type: str = 'gru', emb_size: int = 0, hidden_size: int = 0, encoder: joeynmt.encoders.Encoder = None, attention: str = 'bahdanau', num_layers: int = 1, vocab_size: int = 0, dropout: float = 0.0, emb_dropout: float = 0.0, hidden_dropout: float = 0.0, init_hidden: str = 'bridge', input_feeding: bool = True, freeze: bool = False, **kwargs)[source]¶

Bases: joeynmt.decoders.Decoder

A conditional RNN decoder with attention.

forward(trg_embed: torch.Tensor, encoder_output: torch.Tensor, encoder_hidden: torch.Tensor, src_mask: torch.Tensor, unroll_steps: int, hidden: torch.Tensor = None, prev_att_vector: torch.Tensor = None, **kwargs) -> (<class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'torch.Tensor'>)[source]¶

Unroll the decoder one step at a time for unroll_steps steps. For every step, the _forward_step function is called internally.

During training, the target inputs (trg_embed’) are already known for the full sequence, so the full unrol is done. In this case, `hidden and prev_att_vector are None.

For inference, this function is called with one step at a time since embedded targets are the predictions from the previous time step. In this case, hidden and prev_att_vector are fed from the output of the previous call of this function (from the 2nd step on).

src_mask is needed to mask out the areas of the encoder states that should not receive any attention, which is everything after the first <eos>.

The encoder_output are the hidden states from the encoder and are used as context for the attention.

The encoder_hidden is the last encoder hidden state that is used to initialize the first hidden decoder state (when self.init_hidden_option is “bridge” or “last”).

Parameters:

trg_embed – embedded target inputs, shape (batch_size, trg_length, embed_size)
encoder_output – hidden states from the encoder, shape (batch_size, src_length, encoder.output_size)
encoder_hidden – last state from the encoder, shape (batch_size, encoder.output_size)
src_mask – mask for src states: 0s for padded areas, 1s for the rest, shape (batch_size, 1, src_length)
unroll_steps – number of steps to unroll the decoder RNN
hidden – previous decoder hidden state, if not given it’s initialized as in self.init_hidden, shape (batch_size, num_layers, hidden_size)
prev_att_vector – previous attentional vector, if not given it’s initialized with zeros, shape (batch_size, 1, hidden_size)

Returns:

outputs: shape (batch_size, unroll_steps, vocab_size),
hidden: last hidden state (batch_size, num_layers, hidden_size),
att_probs: attention probabilities

with shape (batch_size, unroll_steps, src_length),
att_vectors: attentional vectors

with shape (batch_size, unroll_steps, hidden_size)

class joeynmt.decoders.TransformerDecoder(num_layers: int = 4, num_heads: int = 8, hidden_size: int = 512, ff_size: int = 2048, dropout: float = 0.1, emb_dropout: float = 0.1, vocab_size: int = 1, freeze: bool = False, **kwargs)[source]¶

Bases: joeynmt.decoders.Decoder

A transformer decoder with N masked layers. Decoder layers are masked so that an attention head cannot see the future.

forward(trg_embed: torch.Tensor = None, encoder_output: torch.Tensor = None, encoder_hidden: torch.Tensor = None, src_mask: torch.Tensor = None, unroll_steps: int = None, hidden: torch.Tensor = None, trg_mask: torch.Tensor = None, **kwargs)[source]¶

Transformer decoder forward pass.

Parameters:	trg_embed – embedded targets encoder_output – source representations encoder_hidden – unused src_mask – unroll_steps – unused hidden – unused trg_mask – to mask out target paddings Note that a subsequent mask is applied here. kwargs –
Returns:

joeynmt.embeddings module¶

Embedding module

class joeynmt.embeddings.Embeddings(embedding_dim: int = 64, scale: bool = False, vocab_size: int = 0, padding_idx: int = 1, freeze: bool = False, **kwargs)[source]¶

Bases: torch.nn.modules.module.Module

Simple embeddings class

forward(x: torch.Tensor) → torch.Tensor[source]¶

Perform lookup for input x in the embedding table.

Parameters:	x – index in the vocabulary
Returns:	embedded representation for x

load_from_file(embed_path: str, vocab: joeynmt.vocabulary.Vocabulary)[source]¶

Load pretrained embedding weights from text file.

First line is expected to contain vocabulary size and dimension. The dimension has to match the model’s specified embedding size, the vocabulary size is used in logging only.
Each line should contain word and embedding weights separated by spaces.
The pretrained vocabulary items that are not part of the joeynmt’s vocabulary will be ignored (not loaded from the file).
The initialization (specified in config[“model”][“embed_initializer”]) of joeynmt’s vocabulary items that are not part of the pretrained vocabulary will be kept (not overwritten in this func).
This function should be called after initialization!

Example:: 2 5 the -0.0230 -0.0264 0.0287 0.0171 0.1403 at -0.0395 -0.1286 0.0275 0.0254 -0.0932

Parameters:	embed_path – embedding weights text file vocab – Vocabulary object

joeynmt.encoders module¶

class joeynmt.encoders.Encoder[source]¶

Bases: torch.nn.modules.module.Module

Base encoder class

output_size¶

Return the output size

Returns:

class joeynmt.encoders.RecurrentEncoder(rnn_type: str = 'gru', hidden_size: int = 1, emb_size: int = 1, num_layers: int = 1, dropout: float = 0.0, emb_dropout: float = 0.0, bidirectional: bool = True, freeze: bool = False, **kwargs)[source]¶

Bases: joeynmt.encoders.Encoder

Encodes a sequence of word embeddings

forward(embed_src: torch.Tensor, src_length: torch.Tensor, mask: torch.Tensor, **kwargs) -> (<class 'torch.Tensor'>, <class 'torch.Tensor'>)[source]¶

Applies a bidirectional RNN to sequence of embeddings x. The input mini-batch x needs to be sorted by src length. x and mask should have the same dimensions [batch, time, dim].

Parameters:

embed_src – embedded src inputs, shape (batch_size, src_len, embed_size)
src_length – length of src inputs (counting tokens before padding), shape (batch_size)
mask – indicates padding areas (zeros where padding), shape (batch_size, src_len, embed_size)

Returns:

output: hidden states with

shape (batch_size, max_length, directions*hidden),
hidden_concat: last hidden state with

shape (batch_size, directions*hidden)

class joeynmt.encoders.TransformerEncoder(hidden_size: int = 512, ff_size: int = 2048, num_layers: int = 8, num_heads: int = 4, dropout: float = 0.1, emb_dropout: float = 0.1, freeze: bool = False, **kwargs)[source]¶

Bases: joeynmt.encoders.Encoder

Transformer Encoder

forward(embed_src: torch.Tensor, src_length: torch.Tensor, mask: torch.Tensor, **kwargs) -> (<class 'torch.Tensor'>, <class 'torch.Tensor'>)[source]¶

Pass the input (and mask) through each layer in turn. Applies a Transformer encoder to sequence of embeddings x. The input mini-batch x needs to be sorted by src length. x and mask should have the same dimensions [batch, time, dim].

Parameters:

embed_src – embedded src inputs, shape (batch_size, src_len, embed_size)
src_length – length of src inputs (counting tokens before padding), shape (batch_size)
mask – indicates padding areas (zeros where padding), shape (batch_size, 1, src_len)

Returns:

output: hidden states with

shape (batch_size, max_length, directions*hidden),
hidden_concat: last hidden state with

shape (batch_size, directions*hidden)

joeynmt.helpers module¶

Collection of helper functions

exception joeynmt.helpers.ConfigurationError[source]¶

Bases: Exception

Custom exception for misspecifications of configuration

joeynmt.helpers.bpe_postprocess(string, bpe_type='subword-nmt') → str[source]¶

Post-processor for BPE output. Recombines BPE-split tokens.

Parameters:	string – bpe_type – one of {“sentencepiece”, “subword-nmt”}
Returns:	post-processed string

joeynmt.helpers.clones(module: torch.nn.modules.module.Module, n: int) → torch.nn.modules.container.ModuleList[source]¶

Produce N identical layers. Transformer helper function.

Parameters:	module – the module to clone n – clone this many times

:return cloned modules

joeynmt.helpers.freeze_params(module: torch.nn.modules.module.Module) → None[source]¶

Freeze the parameters of this module, i.e. do not update them during training

Parameters:	module – freeze parameters of this module

joeynmt.helpers.get_latest_checkpoint(ckpt_dir: str) → Optional[str][source]¶

Returns the latest checkpoint (by time) from the given directory. If there is no checkpoint in this directory, returns None

Parameters:	ckpt_dir –
Returns:	latest checkpoint file

joeynmt.helpers.latest_checkpoint_update(target: pathlib.Path, link_name: str) → Optional[pathlib.Path][source]¶

This function finds the file that the symlink currently points to, sets it to the new target, and returns the previous target if it exists.

Parameters:

target – A path to a file that we want the symlink to point to.
link_name – This is the name of the symlink that we want to update.

Returns:

current_last: This is the previous target of the symlink, before it is

updated in this function. If the symlink did not exist before or did not have a target, None is returned instead.

joeynmt.helpers.load_checkpoint(path: str, use_cuda: bool = True) → dict[source]¶

Load model from saved checkpoint.

Parameters:	path – path to checkpoint use_cuda – using cuda or not
Returns:	checkpoint (dict)

joeynmt.helpers.load_config(path='configs/default.yaml') → dict[source]¶

Loads and parses a YAML configuration file.

Parameters:	path – path to YAML configuration file
Returns:	configuration dictionary

joeynmt.helpers.log_cfg(cfg: dict, prefix: str = 'cfg') → None[source]¶

Write configuration to log.

Parameters:	cfg – configuration to log prefix – prefix for logging

joeynmt.helpers.log_data_info(train_data: torchtext.data.dataset.Dataset, valid_data: torchtext.data.dataset.Dataset, test_data: torchtext.data.dataset.Dataset, src_vocab: joeynmt.vocabulary.Vocabulary, trg_vocab: joeynmt.vocabulary.Vocabulary) → None[source]¶

Log statistics of data and vocabulary.

Parameters:	train_data – valid_data – test_data – src_vocab – trg_vocab –

joeynmt.helpers.make_logger(log_dir: str = None, mode: str = 'train') → str[source]¶

Create a logger for logging the training/testing process.

Parameters:	log_dir – path to file where log is stored as well mode – log file name. ‘train’, ‘test’ or ‘translate’
Returns:	joeynmt version number

joeynmt.helpers.make_model_dir(model_dir: str, overwrite=False) → str[source]¶

Create a new directory for the model.

Parameters:	model_dir – path to model directory overwrite – whether to overwrite an existing directory
Returns:	path to model directory

joeynmt.helpers.set_seed(seed: int) → None[source]¶

Set the random seed for modules torch, numpy and random.

Parameters:	seed – random seed

joeynmt.helpers.store_attention_plots(attentions: numpy.array, targets: List[List[str]], sources: List[List[str]], output_prefix: str, indices: List[int], tb_writer: Optional[torch.utils.tensorboard.writer.SummaryWriter] = None, steps: int = 0) → None[source]¶

Saves attention plots.

Parameters:	attentions – attention scores targets – list of tokenized targets sources – list of tokenized sources output_prefix – prefix for attention plots indices – indices selected for plotting tb_writer – Tensorboard summary writer (optional) steps – current training steps, needed for tb_writer dpi – resolution for images

joeynmt.helpers.subsequent_mask(size: int) → torch.Tensor[source]¶

Mask out subsequent positions (to prevent attending to future positions) Transformer helper function.

Parameters:	size – size of mask (2nd and 3rd dim)
Returns:	Tensor with 0s and 1s of shape (1, size, size)

joeynmt.helpers.symlink_update(target, link_name)[source]¶

joeynmt.helpers.tile(x: torch.Tensor, count: int, dim=0) → torch.Tensor[source]¶

Tiles x on dimension dim count times. From OpenNMT. Used for beam search.

Parameters:	x – tensor to tile count – number of tiles dim – dimension along which the tensor is tiled
Returns:	tiled tensor

joeynmt.initialization module¶

Implements custom initialization

joeynmt.initialization.initialize_model(model: torch.nn.modules.module.Module, cfg: dict, src_padding_idx: int, trg_padding_idx: int) → None[source]¶

This initializes a model based on the provided config.

All initializer configuration is part of the model section of the configuration file. For an example, see e.g. https://github.com/joeynmt/joeynmt/ blob/master/configs/iwslt_envi_xnmt.yaml#L47

The main initializer is set using the initializer key. Possible values are xavier, uniform, normal or zeros. (xavier is the default).

When an initializer is set to uniform, then init_weight sets the range for the values (-init_weight, init_weight).

When an initializer is set to normal, then init_weight sets the standard deviation for the weights (with mean 0).

The word embedding initializer is set using embed_initializer and takes the same values. The default is normal with embed_init_weight = 0.01.

Biases are initialized separately using bias_initializer. The default is zeros, but you can use the same initializers as the main initializer.

Set init_rnn_orthogonal to True if you want RNN orthogonal initialization (for recurrent matrices). Default is False.

lstm_forget_gate controls how the LSTM forget gate is initialized. Default is 1.

Parameters:	model – model to initialize cfg – the model configuration src_padding_idx – index of source padding token trg_padding_idx – index of target padding token

joeynmt.initialization.lstm_forget_gate_init_(cell: torch.nn.modules.rnn.RNNBase, value: float = 1.0) → None[source]¶

Initialize LSTM forget gates with value.

Parameters:	cell – LSTM cell value – initial value, default: 1

joeynmt.initialization.orthogonal_rnn_init_(cell: torch.nn.modules.rnn.RNNBase, gain: float = 1.0)[source]¶: Orthogonal initialization of recurrent weights RNN parameters contain 3 or 4 matrices in one parameter, so we slice it.

joeynmt.initialization.xavier_uniform_n_(w: torch.Tensor, gain: float = 1.0, n: int = 4) → None[source]¶

Xavier initializer for parameters that combine multiple matrices in one parameter for efficiency. This is e.g. used for GRU and LSTM parameters, where e.g. all gates are computed at the same time by 1 big matrix.

Parameters:	w – parameter gain – default 1 n – default 4

joeynmt.metrics module¶

This module holds various MT evaluation metrics.

joeynmt.metrics.bleu(hypotheses, references, tokenize='13a')[source]¶

Raw corpus BLEU from sacrebleu (without tokenization)

Parameters:	hypotheses – list of hypotheses (strings) references – list of references (strings) tokenize – one of {‘none’, ‘13a’, ‘intl’, ‘zh’, ‘ja-mecab’}
Returns:

joeynmt.metrics.chrf(hypotheses, references, remove_whitespace=True)[source]¶

Character F-score from sacrebleu

Parameters:	hypotheses – list of hypotheses (strings) references – list of references (strings) remove_whitespace – (bool)
Returns:

joeynmt.metrics.sequence_accuracy(hypotheses, references)[source]¶

Compute the accuracy of hypothesis tokens: correct tokens / all tokens Tokens are correct if they appear in the same position in the reference.

Parameters:	hypotheses – list of hypotheses (strings) references – list of references (strings)
Returns:

joeynmt.metrics.token_accuracy(hypotheses: List[List[str]], references: List[List[str]]) → float[source]¶

Compute the accuracy of hypothesis tokens: correct tokens / all tokens Tokens are correct if they appear in the same position in the reference.

Parameters:	hypotheses – list of tokenized hypotheses (List[List[str]]) references – list of tokenized references (List[List[str]])
Returns:	token accuracy (float)

joeynmt.model module¶

Module to represents whole models

class joeynmt.model.Model(encoder: joeynmt.encoders.Encoder, decoder: joeynmt.decoders.Decoder, src_embed: joeynmt.embeddings.Embeddings, trg_embed: joeynmt.embeddings.Embeddings, src_vocab: joeynmt.vocabulary.Vocabulary, trg_vocab: joeynmt.vocabulary.Vocabulary)[source]¶

Bases: torch.nn.modules.module.Module

Base Model class

forward(return_type: str = None, **kwargs) -> (<class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'torch.Tensor'>)[source]¶

Interface for multi-gpu

For DataParallel, We need to encapsulate all model call: model.encode(), model.decode(), and model.encode_decode() by model.__call__(). model.__call__() triggers model.forward() together with pre hooks and post hooks, which take care of multi-gpu distribution.

Parameters:	return_type – one of {“loss”, “encode”, “decode”}

loss_function¶

joeynmt.model.build_model(cfg: dict = None, src_vocab: joeynmt.vocabulary.Vocabulary = None, trg_vocab: joeynmt.vocabulary.Vocabulary = None) → joeynmt.model.Model[source]¶

Build and initialize the model according to the configuration.

Parameters:	cfg – dictionary configuration containing model specifications src_vocab – source vocabulary trg_vocab – target vocabulary
Returns:	built and initialized model

joeynmt.plotting module¶

joeynmt.plotting.plot_heatmap(scores: numpy.array, column_labels: List[str], row_labels: List[str], output_path: Optional[str] = None, dpi: int = 300) → matplotlib.figure.Figure[source]¶

Plotting function that can be used to visualize (self-)attention. Plots are saved if output_path is specified, in format that this file ends with (‘pdf’ or ‘png’).

Parameters:	scores – attention scores column_labels – labels for columns (e.g. target tokens) row_labels – labels for rows (e.g. source tokens) output_path – path to save to dpi – set resolution for matplotlib
Returns:	pyplot figure

joeynmt.prediction module¶

This modules holds methods for generating predictions from a model.

joeynmt.prediction.parse_test_args(cfg, mode='test')[source]¶: parse test args :param cfg: config object :param mode: ‘test’ or ‘translate’ :return:

joeynmt.prediction.test(cfg_file, ckpt: str, batch_class: joeynmt.batch.Batch = <class 'joeynmt.batch.Batch'>, output_path: str = None, save_attention: bool = False, datasets: dict = None) → None[source]¶

Main test function. Handles loading a model from checkpoint, generating translations and storing them and attention plots.

Parameters:	cfg_file – path to configuration file ckpt – path to checkpoint to load batch_class – class type of batch output_path – path to output datasets – datasets to predict save_attention – whether to save the computed attention weights

joeynmt.prediction.translate(cfg_file: str, ckpt: str, output_path: str = None, batch_class: joeynmt.batch.Batch = <class 'joeynmt.batch.Batch'>, n_best: int = 1) → None[source]¶

Interactive translation function. Loads model from checkpoint and translates either the stdin input or asks for input to translate interactively. The input has to be pre-processed according to the data that the model was trained on, i.e. tokenized or split into subwords. Translations are printed to stdout.

Parameters:	cfg_file – path to configuration file ckpt – path to checkpoint to load output_path – path to output file batch_class – class type of batch n_best – amount of candidates to display

joeynmt.prediction.validate_on_data(model: joeynmt.model.Model, data: torchtext.data.dataset.Dataset, batch_size: int, use_cuda: bool, max_output_length: int, level: str, eval_metric: Optional[str], n_gpu: int, batch_class: joeynmt.batch.Batch = <class 'joeynmt.batch.Batch'>, compute_loss: bool = False, beam_size: int = 1, beam_alpha: int = -1, batch_type: str = 'sentence', postprocess: bool = True, bpe_type: str = 'subword-nmt', sacrebleu: dict = None, n_best: int = 1) -> (<class 'float'>, <class 'float'>, <class 'float'>, typing.List[str], typing.List[typing.List[str]], typing.List[str], typing.List[str], typing.List[typing.List[str]], typing.List[<built-in function array>])[source]¶

Generate translations for the given data. If compute_loss is True and references are given, also compute the loss.

Parameters:

model – model module
data – dataset for validation
batch_size – validation batch size
batch_class – class type of batch
use_cuda – if True, use CUDA
max_output_length – maximum length for generated hypotheses
level – segmentation level, one of “char”, “bpe”, “word”
eval_metric – evaluation metric, e.g. “bleu”
n_gpu – number of GPUs
compute_loss – whether to computes a scalar loss for given inputs and targets
beam_size – beam size for validation. If <2 then greedy decoding (default).
beam_alpha – beam search alpha for length penalty, disabled if set to -1 (default).
batch_type – validation batch type (sentence or token)
postprocess – if True, remove BPE segmentation from translations
bpe_type – bpe type, one of {“subword-nmt”, “sentencepiece”}
sacrebleu – sacrebleu options
n_best – Amount of candidates to return

Returns:

current_valid_score: current validation score [eval_metric],
valid_loss: validation loss,
valid_ppl:, validation perplexity,
valid_sources: validation sources,
valid_sources_raw: raw validation sources (before post-processing),
valid_references: validation references,
valid_hypotheses: validation_hypotheses,
decoded_valid: raw validation hypotheses (before post-processing),
valid_attention_scores: attention scores for validation hypotheses

joeynmt.search module¶

joeynmt.search.greedy(src_mask: torch.Tensor, max_output_length: int, model: joeynmt.model.Model, encoder_output: torch.Tensor, encoder_hidden: torch.Tensor) -> (<built-in function array>, <built-in function array>)[source]¶

Greedy decoding. Select the token word highest probability at each time step. This function is a wrapper that calls recurrent_greedy for recurrent decoders and transformer_greedy for transformer decoders.

Parameters:	src_mask – mask for source inputs, 0 for positions after </s> max_output_length – maximum length for the hypotheses model – model to use for greedy decoding encoder_output – encoder hidden states for attention encoder_hidden – encoder last state for decoder initialization
Returns:

joeynmt.search.transformer_greedy(src_mask: torch.Tensor, max_output_length: int, model: joeynmt.model.Model, encoder_output: torch.Tensor, encoder_hidden: torch.Tensor) -> (<built-in function array>, None)[source]¶

Special greedy function for transformer, since it works differently. The transformer remembers all previous states and attends to them.

Parameters:

src_mask – mask for source inputs, 0 for positions after </s>
max_output_length – maximum length for the hypotheses
model – model to use for greedy decoding
encoder_output – encoder hidden states for attention
encoder_hidden – encoder final state (unused in Transformer)

Returns:

stacked_output: output hypotheses (2d array of indices),
stacked_attention_scores: attention scores (3d array)

joeynmt.search.beam_search(model: joeynmt.model.Model, size: int, encoder_output: torch.Tensor, encoder_hidden: torch.Tensor, src_mask: torch.Tensor, max_output_length: int, alpha: float, n_best: int = 1) -> (<built-in function array>, <built-in function array>)[source]¶

Beam search with size k. Inspired by OpenNMT-py, adapted for Transformer. In each decoding step, find the k most likely partial hypotheses. :param model: :param size: size of the beam :param encoder_output: :param encoder_hidden: :param src_mask: :param max_output_length: :param alpha: alpha factor for length penalty :param n_best: return this many hypotheses, <= beam (currently only 1) :return:

stacked_output: output hypotheses (2d array of indices),

stacked_attention_scores: attention scores (3d array)

joeynmt.search.run_batch(model: joeynmt.model.Model, batch: joeynmt.batch.Batch, max_output_length: int, beam_size: int, beam_alpha: float, n_best: int = 1) -> (<built-in function array>, <built-in function array>)[source]¶

Get outputs and attentions scores for a given batch

Parameters:	model – Model class batch – batch to generate hypotheses for max_output_length – maximum length of hypotheses beam_size – size of the beam for beam search, if 0 use greedy beam_alpha – alpha value for beam search n_best – candidates to return
Returns:	stacked_output: hypotheses for batch, stacked_attention_scores: attention scores for batch

joeynmt.training module¶

Training module

class joeynmt.training.TrainManager(model: joeynmt.model.Model, config: dict, batch_class: joeynmt.batch.Batch = <class 'joeynmt.batch.Batch'>)[source]¶

Bases: object

Manages training loop, validations, learning rate scheduling and early stopping.

class TrainStatistics(steps: int = 0, stop: bool = False, total_tokens: int = 0, best_ckpt_iter: int = 0, best_ckpt_score: float = inf, minimize_metric: bool = True)[source]¶

Bases: object

is_best(score)[source]¶

init_from_checkpoint(path: str, reset_best_ckpt: bool = False, reset_scheduler: bool = False, reset_optimizer: bool = False, reset_iter_state: bool = False) → None[source]¶

Initialize the trainer from a given checkpoint file.

This checkpoint file contains not only model parameters, but also scheduler and optimizer states, see self._save_checkpoint.

Parameters:

path – path to checkpoint
reset_best_ckpt – reset tracking of the best checkpoint, use for domain adaptation with a new dev set or when using a new metric for fine-tuning.
reset_scheduler – reset the learning rate scheduler, and do not use the one stored in the checkpoint.
reset_optimizer – reset the optimizer, and do not use the one stored in the checkpoint.
reset_iter_state – reset the sampler’s internal state and do not use the one stored in the checkpoint.

train_and_validate(train_data: torchtext.data.dataset.Dataset, valid_data: torchtext.data.dataset.Dataset) → None[source]¶

Train the model and validate it from time to time on the validation set.

Parameters:	train_data – training data valid_data – validation data

joeynmt.training.train(cfg_file: str) → None[source]¶

Main training function. After training, also test on test data if given.

Parameters:	cfg_file – path to configuration yaml file

joeynmt.vocabulary module¶

Vocabulary module

class joeynmt.vocabulary.Vocabulary(tokens: List[str] = None, file: str = None)[source]¶

Bases: object

Vocabulary represents mapping between tokens and indices.

add_tokens(tokens: List[str]) → None[source]¶

Add list of tokens to vocabulary

Parameters:	tokens – list of tokens to add to the vocabulary

array_to_sentence(array: numpy.array, cut_at_eos=True, skip_pad=True) → List[str][source]¶

Converts an array of IDs to a sentence, optionally cutting the result off at the end-of-sequence token.

Parameters:	array – 1D array containing indices cut_at_eos – cut the decoded sentences at the first <eos> skip_pad – skip generated <pad> tokens
Returns:	list of strings (tokens)

arrays_to_sentences(arrays: numpy.array, cut_at_eos=True, skip_pad=True) → List[List[str]][source]¶

Convert multiple arrays containing sequences of token IDs to their sentences, optionally cutting them off at the end-of-sequence token.

Parameters:	arrays – 2D array containing indices cut_at_eos – cut the decoded sentences at the first <eos> skip_pad – skip generated <pad> tokens
Returns:	list of list of strings (tokens)

is_unk(token: str) → bool[source]¶

Check whether a token is covered by the vocabulary

Parameters:	token –
Returns:	True if covered, False otherwise

to_file(file: str) → None[source]¶

Save the vocabulary to a file, by writing token with index i in line i.

Parameters:	file – path to file where the vocabulary is written

joeynmt.vocabulary.build_vocab(field: str, max_size: int, min_freq: int, dataset: torchtext.data.dataset.Dataset, vocab_file: str = None) → joeynmt.vocabulary.Vocabulary[source]¶

Builds vocabulary for a torchtext field from given`dataset` or vocab_file.

Parameters:	field – attribute e.g. “src” max_size – maximum size of vocabulary min_freq – minimum frequency for an item to be included dataset – dataset to load data for field from vocab_file – file to store the vocabulary, if not None, load vocabulary from here
Returns:	Vocabulary created from either dataset or vocab_file

joeynmt.loss module¶

Module to implement training loss

class joeynmt.loss.XentLoss(pad_index: int, smoothing: float = 0.0)[source]¶

Bases: torch.nn.modules.module.Module

Cross-Entropy Loss with optional label smoothing

forward(log_probs, targets)[source]¶

Compute the cross-entropy between logits and targets.

If label smoothing is used, target distributions are not one-hot, but “1-smoothing” for the correct target token and the rest of the probability mass is uniformly spread across the other tokens.

Parameters:	log_probs – log probabilities as predicted by model targets – target indices
Returns:

joeynmt.transformer_layers module¶

class joeynmt.transformer_layers.MultiHeadedAttention(num_heads: int, size: int, dropout: float = 0.1)[source]¶

Bases: torch.nn.modules.module.Module

Multi-Head Attention module from “Attention is All You Need”

Implementation modified from OpenNMT-py. https://github.com/OpenNMT/OpenNMT-py

forward(k: torch.Tensor, v: torch.Tensor, q: torch.Tensor, mask: torch.Tensor = None)[source]¶

Computes multi-headed attention.

Parameters:	k – keys [B, M, D] with M being the sentence length. v – values [B, M, D] q – query [B, M, D] mask – optional mask [B, 1, M]
Returns:

class joeynmt.transformer_layers.PositionalEncoding(size: int = 0, max_len: int = 5000)[source]¶

Bases: torch.nn.modules.module.Module

Pre-compute position encodings (PE). In forward pass, this adds the position-encodings to the input for as many time steps as necessary.

Implementation based on OpenNMT-py. https://github.com/OpenNMT/OpenNMT-py

forward(emb)[source]¶

Embed inputs. Args:

emb (FloatTensor): Sequence of word vectors

(seq_len, batch_size, self.dim)

class joeynmt.transformer_layers.PositionwiseFeedForward(input_size, ff_size, dropout=0.1)[source]¶

Bases: torch.nn.modules.module.Module

Position-wise Feed-forward layer Projects to ff_size and then back down to input_size.

forward(x)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class joeynmt.transformer_layers.TransformerDecoderLayer(size: int = 0, ff_size: int = 0, num_heads: int = 0, dropout: float = 0.1)[source]¶

Bases: torch.nn.modules.module.Module

Transformer decoder layer.

Consists of self-attention, source-attention, and feed-forward.

forward(x: torch.Tensor = None, memory: torch.Tensor = None, src_mask: torch.Tensor = None, trg_mask: torch.Tensor = None) → torch.Tensor[source]¶

Forward pass of a single Transformer decoder layer.

Parameters:	x – inputs memory – source representations src_mask – source mask trg_mask – target mask (so as to not condition on future steps)
Returns:	output tensor

class joeynmt.transformer_layers.TransformerEncoderLayer(size: int = 0, ff_size: int = 0, num_heads: int = 0, dropout: float = 0.1)[source]¶

Bases: torch.nn.modules.module.Module

One Transformer encoder layer has a Multi-head attention layer plus a position-wise feed-forward layer.

forward(x: torch.Tensor, mask: torch.Tensor) → torch.Tensor[source]¶

Forward pass for a single transformer encoder layer. First applies layer norm, then self attention, then dropout with residual connection (adding the input to the result), and then a position-wise feed-forward layer.

Parameters:	x – layer input mask – input mask
Returns:	output tensor

joeynmt package¶

Submodules¶

joeynmt.attention module¶

joeynmt.batch module¶

joeynmt.builders module¶

joeynmt.constants module¶

joeynmt.data module¶

joeynmt.decoders module¶

joeynmt.embeddings module¶

joeynmt.encoders module¶

joeynmt.helpers module¶

joeynmt.initialization module¶

joeynmt.metrics module¶

joeynmt.model module¶

joeynmt.plotting module¶

joeynmt.prediction module¶

joeynmt.search module¶

joeynmt.training module¶

joeynmt.vocabulary module¶

joeynmt.loss module¶

joeynmt.transformer_layers module¶

Module contents¶