API Documentation#

Module contents#

Submodules#

joeynmt.attention module#

Attention modules

class joeynmt.attention.AttentionMechanism(*args, **kwargs)[source]#

Bases: Module

Base attention class

forward(*inputs)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class joeynmt.attention.BahdanauAttention(hidden_size: int = 1, key_size: int = 1, query_size: int = 1)[source]#

Bases: AttentionMechanism

Implements Bahdanau (MLP) attention

Section A.1.2 in https://arxiv.org/abs/1409.0473.

compute_proj_keys(keys: Tensor) → None[source]#

Compute the projection of the keys. Is efficient if pre-computed before receiving individual queries.

Parameters:: keys –
Returns:

compute_proj_query(query: Tensor)[source]#

Compute the projection of the query.

Parameters:: query –
Returns:

forward(query: Tensor, mask: Tensor, values: Tensor) → Tuple[Tensor, Tensor][source]#

Bahdanau MLP attention forward pass.

Parameters:

query – the item (decoder state) to compare with the keys/memory, shape (batch_size, 1, decoder.hidden_size)
mask – mask out keys position (0 in invalid positions, 1 else), shape (batch_size, 1, src_length)
values – values (encoder states), shape (batch_size, src_length, encoder.hidden_size)

Returns:

context vector of shape (batch_size, 1, value_size),
attention probabilities of shape (batch_size, 1, src_length)

class joeynmt.attention.LuongAttention(hidden_size: int = 1, key_size: int = 1)[source]#

Bases: AttentionMechanism

Implements Luong (bilinear / multiplicative) attention.

Eq. 8 (“general”) in http://aclweb.org/anthology/D15-1166.

compute_proj_keys(keys: Tensor) → None[source]#

Compute the projection of the keys and assign them to self.proj_keys. This pre-computation is efficiently done for all keys before receiving individual queries.

Parameters:: keys – shape (batch_size, src_length, encoder.hidden_size)

forward(query: Tensor, mask: Tensor, values: Tensor) → Tuple[Tensor, Tensor][source]#

Luong (multiplicative / bilinear) attention forward pass. Computes context vectors and attention scores for a given query and all masked values and returns them.

Parameters:

query – the item (decoder state) to compare with the keys/memory, shape (batch_size, 1, decoder.hidden_size)
mask – mask out keys position (0 in invalid positions, 1 else), shape (batch_size, 1, src_length)
values – values (encoder states), shape (batch_size, src_length, encoder.hidden_size)

Returns:

context vector of shape (batch_size, 1, value_size),
attention probabilities of shape (batch_size, 1, src_length)

joeynmt.batch module#

Implementation of a mini-batch.

class joeynmt.batch.Batch(src: Tensor, src_length: Tensor, src_prompt_mask: Tensor | None, trg: Tensor | None, trg_prompt_mask: Tensor | None, indices: Tensor, device: device, pad_index: int, eos_index: int, is_train: bool = True)[source]#

Bases: object

Object for holding a batch of data with mask during training. Input is yielded from collate_fn() called by torch.data.utils.DataLoader.

normalize(tensor: Tensor, normalization: str = 'none', n_gpu: int = 1, n_accumulation: int = 1) → Tensor[source]#

Normalizes batch tensor (i.e. loss). Takes sum over multiple gpus, divides by nseqs or ntokens, divide by n_gpu, then divide by n_accumulation.

Parameters:

tensor – (Tensor) tensor to normalize, i.e. batch loss
normalization – (str) one of {batch, tokens, none}
n_gpu – (int) the number of gpus
n_accumulation – (int) the number of gradient accumulation

Returns:

normalized tensor

static score(log_probs: Tensor, trg: Tensor, pad_index: int) → ndarray[source]#: Look up the score of the trg token (ground truth) in the batch

sort_by_src_length() → List[int][source]#

Sort by src length (descending) and return index to revert sort

Returns:: list of indices

joeynmt.builders module#

Collection of builder functions

class joeynmt.builders.BaseScheduler(optimizer: Optimizer)[source]#

Bases: object

Base LR Scheduler decay at “step”

load_state_dict(state_dict)[source]#: Given a state_dict, this function loads scheduler’s state

state_dict()[source]#: Returns dictionary of values necessary to reconstruct scheduler

step(step)[source]#: Update parameters and rate

class joeynmt.builders.NoamScheduler(hidden_size: int, optimizer: Optimizer, factor: float = 1.0, warmup: int = 4000)[source]#

Bases: BaseScheduler

The Noam learning rate scheduler used in “Attention is all you need” See Eq. 3 in https://arxiv.org/abs/1706.03762

load_state_dict(state_dict)[source]#: Given a state_dict, this function loads scheduler’s state

state_dict()[source]#: Returns dictionary of values necessary to reconstruct scheduler

class joeynmt.builders.WarmupExponentialDecayScheduler(optimizer: Optimizer, peak_rate: float = 0.001, decay_length: int = 10000, warmup: int = 4000, decay_rate: float = 0.5, min_rate: float = 1e-05)[source]#

Bases: BaseScheduler

A learning rate scheduler similar to Noam, but modified: Keep the warm up period but make it so that the decay rate can be tuneable. The decay is exponential up to a given minimum rate.

load_state_dict(state_dict)[source]#: Given a state_dict, this function loads scheduler’s state

state_dict()[source]#: Returns dictionary of values necessary to reconstruct scheduler

class joeynmt.builders.WarmupInverseSquareRootScheduler(optimizer: Optimizer, peak_rate: float = 0.001, warmup: int = 10000, min_rate: float = 1e-05)[source]#

Bases: BaseScheduler

Decay the LR based on the inverse square root of the update number. In the warmup phase, we linearly increase the learning rate. After warmup, we decrease the learning rate as follows: ` decay_factor = peak_rate * sqrt(warmup) # constant value lr = decay_factor / sqrt(step) ` cf.) https://github.com/pytorch/fairseq/blob/main/fairseq/optim/lr_scheduler/inverse_square_root_schedule.py

load_state_dict(state_dict)[source]#: Given a state_dict, this function loads scheduler’s state

state_dict()[source]#: Returns dictionary of values necessary to reconstruct scheduler

joeynmt.builders.build_activation(activation: str = 'relu') → Callable[source]#: Returns the activation function

joeynmt.builders.build_gradient_clipper(cfg: Dict) → Callable | None[source]#

Define the function for gradient clipping as specified in configuration. If not specified, returns None.

Current options:

“clip_grad_val”: clip the gradients if they exceed this value,
see torch.nn.utils.clip_grad_value_
“clip_grad_norm”: clip the gradients if their norm exceeds this value,
see torch.nn.utils.clip_grad_norm_

Parameters:: cfg – dictionary with training configurations
Returns:: clipping function (in-place) or None if no gradient clipping

joeynmt.builders.build_optimizer(cfg: Dict, parameters: Generator) → Optimizer[source]#

Create an optimizer for the given parameters as specified in config.

Except for the weight decay and initial learning rate, default optimizer settings are used.

Currently supported configuration settings for “optimizer”:

“sgd” (default): see torch.optim.SGD
“adam”: see torch.optim.adam
“adamw”: see torch.optim.adamw
“adagrad”: see torch.optim.adagrad
“adadelta”: see torch.optim.adadelta
“rmsprop”: see torch.optim.RMSprop

The initial learning rate is set according to “learning_rate” in the config. The weight decay is set according to “weight_decay” in the config. If they are not specified, the initial learning rate is set to 3.0e-4, the weight decay to 0.

Note that the scheduler state is saved in the checkpoint, so if you load a model for further training you have to use the same type of scheduler.

Parameters:

cfg – configuration dictionary
parameters –

Returns:

optimizer

joeynmt.builders.build_scheduler(cfg: Dict, optimizer: Optimizer, scheduler_mode: str, hidden_size: int = 0)[source]#

Create a learning rate scheduler if specified in config and determine when a scheduler step should be executed.

Current options:

“plateau”: see torch.optim.lr_scheduler.ReduceLROnPlateau
“decaying”: see torch.optim.lr_scheduler.StepLR
“exponential”: see torch.optim.lr_scheduler.ExponentialLR
“noam”: see joeynmt.builders.NoamScheduler
“warmupexponentialdecay”: see joeynmt.builders.WarmupExponentialDecayScheduler
“warmupinversesquareroot”: see joeynmt.builders.WarmupInverseSquareRootScheduler

If no scheduler is specified, returns (None, None) which will result in a constant learning rate.

Parameters:

cfg – training configuration
optimizer – optimizer for the scheduler, determines the set of parameters which the scheduler sets the learning rate for
scheduler_mode – “min” or “max”, depending on whether the validation score should be minimized or maximized. Only relevant for “plateau”.
hidden_size – encoder hidden size (required for NoamScheduler)

Returns:

scheduler: scheduler object,
scheduler_step_at: either “validation”, “epoch”, “step” or “none”

joeynmt.config module#

Module for configuration

This can only be a temporary solution. TODO: Consider better configuration and validation cf. https://github.com/joeynmt/joeynmt/issues/196

class joeynmt.config.BaseConfig(name, joeynmt_version, model_dir, device, n_gpu, num_workers, autocast, seed, train, test, data, model)#

Bases: tuple

autocast: Dict#: Alias for field number 6

data: Dict#: Alias for field number 10

device: device#: Alias for field number 3

joeynmt_version: str | None#: Alias for field number 1

model: Dict#: Alias for field number 11

model_dir: Path#: Alias for field number 2

n_gpu: int#: Alias for field number 4

name: str#: Alias for field number 0

num_workers: int#: Alias for field number 5

seed: int#: Alias for field number 7

test: TestConfig#: Alias for field number 9

train: TrainConfig#: Alias for field number 8

exception joeynmt.config.ConfigurationError[source]#

Bases: Exception

Custom exception for misspecifications of configuration

class joeynmt.config.TestConfig(load_model, batch_size, batch_type, max_output_length, min_output_length, eval_metrics, sacrebleu_cfg, beam_size, beam_alpha, n_best, return_attention, return_prob, generate_unk, repetition_penalty, no_repeat_ngram_size)#

Bases: tuple

batch_size: int#: Alias for field number 1

batch_type: str#: Alias for field number 2

beam_alpha: int#: Alias for field number 8

beam_size: int#: Alias for field number 7

eval_metrics: List[str]#: Alias for field number 5

generate_unk: bool#: Alias for field number 12

load_model: Path | None#: Alias for field number 0

max_output_length: int#: Alias for field number 3

min_output_length: int#: Alias for field number 4

n_best: int#: Alias for field number 9

no_repeat_ngram_size: int#: Alias for field number 14

repetition_penalty: float#: Alias for field number 13

return_attention: bool#: Alias for field number 10

return_prob: str#: Alias for field number 11

sacrebleu_cfg: Dict | None#: Alias for field number 6

class joeynmt.config.TrainConfig(load_model, load_encoder, load_decoder, loss, normalization, label_smoothing, optimizer, adam_betas, learning_rate, learning_rate_min, learning_rate_factor, learning_rate_warmup, scheduling, patience, decrease_factor, weight_decay, clip_grad_norm, clip_grad_val, keep_best_ckpts, logging_freq, validation_freq, print_valid_sents, early_stopping_metric, minimize_metric, shuffle, epochs, max_updates, batch_size, batch_type, batch_multiplier, reset_best_ckpt, reset_scheduler, reset_optimizer, reset_iter_state)#

Bases: tuple

adam_betas: List[float]#: Alias for field number 7

batch_multiplier: int#: Alias for field number 29

batch_size: int#: Alias for field number 27

batch_type: str#: Alias for field number 28

clip_grad_norm: float | None#: Alias for field number 16

clip_grad_val: float | None#: Alias for field number 17

decrease_factor: float#: Alias for field number 14

early_stopping_metric: str#: Alias for field number 22

epochs: int#: Alias for field number 25

keep_best_ckpts: int#: Alias for field number 18

label_smoothing: float#: Alias for field number 5

learning_rate: float#: Alias for field number 8

learning_rate_factor: int#: Alias for field number 10

learning_rate_min: float#: Alias for field number 9

learning_rate_warmup: int#: Alias for field number 11

load_decoder: Path | None#: Alias for field number 2

load_encoder: Path | None#: Alias for field number 1

load_model: Path | None#: Alias for field number 0

logging_freq: int#: Alias for field number 19

loss: str#: Alias for field number 3

max_updates: int#: Alias for field number 26

minimize_metric: bool#: Alias for field number 23

normalization: str#: Alias for field number 4

optimizer: str#: Alias for field number 6

patience: int#: Alias for field number 13

print_valid_sents: List[int]#: Alias for field number 21

reset_best_ckpt: bool#: Alias for field number 30

reset_iter_state: bool#: Alias for field number 33

reset_optimizer: bool#: Alias for field number 32

reset_scheduler: bool#: Alias for field number 31

scheduling: str | None#: Alias for field number 12

shuffle: bool#: Alias for field number 24

validation_freq: int#: Alias for field number 20

weight_decay: float#: Alias for field number 15

joeynmt.config.load_config(cfg_file: str = 'configs/default.yaml') → Dict[source]#

Loads and parses a YAML configuration file.

Parameters:: cfg_file – path to YAML configuration file
Returns:: configuration dictionary

joeynmt.config.log_config(cfg: Dict, prefix: str = 'cfg') → None[source]#

Print configuration to console log.

Parameters:

cfg – configuration to log
prefix – prefix for logging

joeynmt.config.parse_global_args(cfg: Dict = None, rank: int = 0, mode: str = 'train') → BaseConfig[source]#

Parse and validate global args

Parameters:

cfg – config specified in yaml file
rank –
mode –

joeynmt.config.parse_test_args(cfg: Dict = None, mode: str = 'test') → TestConfig[source]#

Parse and validate test args

Parameters:

cfg – testing section in config yaml
mode –

joeynmt.config.parse_train_args(cfg: Dict = None, mode: str = 'train') → TrainConfig[source]#

Parse and validate train args

Parameters:

cfg – training section in config yaml
mode –

joeynmt.config.set_validation_args(args: TestConfig) → TestConfig[source]#

Config for validation

Parameters:: args – testing section in config yaml

joeynmt.data module#

Data module

joeynmt.data.load_data(cfg: Dict, datasets: list = None) → Tuple[Vocabulary, Vocabulary, BaseDataset | None, BaseDataset | None, BaseDataset | None][source]#

Load train, dev and optionally test data as specified in configuration. Vocabularies are created from the training set with a limit of voc_limit tokens and a minimum token frequency of voc_min_freq (specified in the configuration dictionary).

The training data is filtered to include sentences up to max_length on source and target side.

If you set random_{train|dev}_subset, a random selection of this size is used from the {train|development} set instead of the full {train|development} set.

Parameters:

cfg – configuration dictionary for data (“data” part of config file)
datasets – list of dataset names to load

Returns:

src_vocab: source vocabulary
trg_vocab: target vocabulary
train_data: training dataset
dev_data: development dataset
test_data: test dataset if given, otherwise None

joeynmt.datasets module#

Dataset module

class joeynmt.datasets.BaseDataset(path: str, src_lang: str, trg_lang: str, split: str = 'train', has_trg: bool = False, has_prompt: Dict[str, bool] = None, tokenizer: Dict[str, BasicTokenizer] = None, sequence_encoder: Dict[str, Callable] = None, random_subset: int = -1)[source]#

Bases: Dataset

BaseDataset which loads and looks up data. - holds pointer to tokenizers, encoding functions.

Parameters:

path – path to data directory
src_lang – source language code, i.e. en
trg_lang – target language code, i.e. de
has_trg – bool indicator if trg exists
has_prompt – bool indicator if prompt exists
split – bool indicator for train set or not
tokenizer – tokenizer objects
sequence_encoder – encoding functions

collate_fn(batch: List[Tuple], pad_index: int, eos_index: int, device: device = device(type='cpu')) → Batch[source]#

Custom collate function. See https://pytorch.org/docs/stable/data.html#dataloader-collate-fn for details. Please override the batch class here. (not in TrainManager)

Parameters:

batch –
pad_index –
eos_index –
device –

Returns:

joeynmt batch object

get_item(idx: int, lang: str, is_train: bool = None) → List[str][source]#

seek one src/trg item of the given index.

tokenization is applied here.
length-filtering, bpe-dropout etc also triggered if self.split == “train”

get_list(lang: str, tokenized: bool = False, subsampled: bool = True) → List[str] | List[List[str]][source]#: get data column-wise.

load_data(path: Path, **kwargs) → Any[source]#

load data

preprocessing (lowercasing etc) is applied here.

lookup_item(idx: int, lang: str) → Tuple[str, str][source]#

make_iter(batch_size: int, batch_type: str = 'sentence', seed: int = 42, shuffle: bool = False, num_workers: int = 0, pad_index: int = 1, eos_index: int = 3, device: device = device(type='cpu'), generator_state: Tensor = None) → DataLoader[source]#

Returns a torch DataLoader for a torch Dataset. (no bucketing)

Parameters:

batch_size – size of the batches the iterator prepares
batch_type – measure batch size by sentence count or by token count
seed – random seed for shuffling
shuffle – whether to shuffle the order of sequences before each epoch (for testing, no effect even if set to True; generator is still used for random subsampling, but not for permutation!)
num_workers – number of cpus for multiprocessing
pad_index –
eos_index –
device –
generator_state –

Returns:

torch DataLoader

reset_indices(random_subset: int = None)[source]#

property src: List[str]#: get detokenized preprocessed data in src language.

property trg: List[str]#: get detokenized preprocessed data in trg language.

class joeynmt.datasets.BaseHuggingfaceDataset(path: str, src_lang: str, trg_lang: str, has_trg: bool = True, has_prompt: Dict[str, bool] = None, tokenizer: Dict[str, BasicTokenizer] = None, sequence_encoder: Dict[str, Callable] = None, random_subset: int = -1, **kwargs)[source]#

Bases: BaseDataset

Wrapper for Huggingface’s dataset object cf.) https://huggingface.co/docs/datasets

COLUMN_NAME = 'sentence'#

get_list(lang: str, tokenized: bool = False, subsampled: bool = True) → List[str] | List[List[str]][source]#: get data column-wise.

load_data(path: str, **kwargs) → Any[source]#

load data

preprocessing (lowercasing etc) is applied here.

lookup_item(idx: int, lang: str) → Tuple[str, str][source]#

class joeynmt.datasets.HuggingfaceTranslationDataset(path: str, src_lang: str, trg_lang: str, has_trg: bool = True, has_prompt: Dict[str, bool] = None, tokenizer: Dict[str, BasicTokenizer] = None, sequence_encoder: Dict[str, Callable] = None, random_subset: int = -1, **kwargs)[source]#

Bases: BaseHuggingfaceDataset

Wrapper for Huggingface’s datasets.features.Translation class cf.) https://github.com/huggingface/datasets/blob/master/src/datasets/features/translation.py

COLUMN_NAME = 'translation'#

load_data(path: str, **kwargs) → Any[source]#

load data

preprocessing (lowercasing etc) is applied here.

class joeynmt.datasets.PlaintextDataset(path: str, src_lang: str, trg_lang: str, split: str = 'train', has_trg: bool = False, has_prompt: Dict[str, bool] = None, tokenizer: Dict[str, BasicTokenizer] = None, sequence_encoder: Dict[str, Callable] = None, random_subset: int = -1, **kwargs)[source]#

Bases: BaseDataset

PlaintextDataset which stores plain text pairs. - used for text file data in the format of one sentence per line.

get_list(lang: str, tokenized: bool = False, subsampled: bool = True) → List[str] | List[List[str]][source]#: Return list of preprocessed sentences in the given language. (not length-filtered, no bpe-dropout)

load_data(path: str, **kwargs) → Any[source]#

load data

preprocessing (lowercasing etc) is applied here.

lookup_item(idx: int, lang: str) → Tuple[str, str][source]#

class joeynmt.datasets.SentenceBatchSampler(sampler: Sampler, batch_size: int, drop_last: bool, seed: int)[source]#

Bases: BatchSampler

Wraps another sampler to yield a mini-batch of indices based on num of instances. An instance longer than dataset.max_len will be filtered out.

Parameters:

sampler – Base sampler. Can be any iterable object
batch_size – Size of mini-batch.
drop_last – If True, the sampler will drop the last batch if its size would be less than batch_size

get_state()[source]#

property num_samples: int#

Returns number of samples in the dataset. This may change during sampling.

Note: len(dataset) won’t change during sampling.: Use len(dataset) instead, to retrieve the original dataset length.

reset() → None[source]#

set_seed(seed: int) → None[source]#

set_state(state) → None[source]#

class joeynmt.datasets.StreamDataset(path: str, src_lang: str, trg_lang: str, split: str = 'test', has_trg: bool = False, has_prompt: Dict[str, bool] = None, tokenizer: Dict[str, BasicTokenizer] = None, sequence_encoder: Dict[str, Callable] = None, random_subset: int = -1, **kwargs)[source]#

Bases: BaseDataset

StreamDataset which interacts with stream inputs. - called by translate() func in prediction.py.

lookup_item(idx: int, lang: str) → Tuple[str, str][source]#

reset_cache()[source]#

set_item(src_line: str, trg_line: str | None = None, src_prompt: str | None = None, trg_prompt: str | None = None) → None[source]#

Set input text to the cache.

Parameters:

src_line – (non-empty) str
trg_line – Optional[str]
src_prompt – Optional[str]
trg_prompt – Optional[str]

class joeynmt.datasets.TokenBatchSampler(sampler: Sampler, batch_size: int, drop_last: bool, seed: int)[source]#

Bases: SentenceBatchSampler

Wraps another sampler to yield a mini-batch of indices based on num of tokens (incl. padding). An instance longer than dataset.max_len or shorter than dataset.min_len will be filtered out. * no bucketing implemented

Warning

In DDP, we shouldn’t use TokenBatchSampler for prediction, because we cannot ensure that the data points will be distributed evenly across devices. ddp_merge() (dist.all_gather()) called in predict() can get stuck.

Parameters:

sampler – Base sampler. Can be any iterable object
batch_size – Size of mini-batch.
drop_last – If True, the sampler will drop the last batch if its size would be less than batch_size

class joeynmt.datasets.TsvDataset(path: str, src_lang: str, trg_lang: str, split: str = 'train', has_trg: bool = False, has_prompt: Dict[str, bool] = None, tokenizer: Dict[str, BasicTokenizer] = None, sequence_encoder: Dict[str, Callable] = None, random_subset: int = -1, **kwargs)[source]#

Bases: BaseDataset

TsvDataset which handles data in tsv format. - file_name should be specified without extension .tsv - needs src_lang and trg_lang (i.e. en, de) in header. see: test/data/toy/dev.tsv

get_list(lang: str, tokenized: bool = False, subsampled: bool = True) → List[str] | List[List[str]][source]#: get data column-wise.

load_data(path: str, **kwargs) → Any[source]#

load data

preprocessing (lowercasing etc) is applied here.

lookup_item(idx: int, lang: str) → Tuple[str, str][source]#

joeynmt.datasets.build_dataset(dataset_type: str, path: str, src_lang: str, trg_lang: str, split: str, tokenizer: Dict = None, sequence_encoder: Dict = None, has_prompt: Dict = None, random_subset: int = -1, **kwargs)[source]#

Builds a dataset.

Parameters:

dataset_type – (str) one of {plain, tsv, stream, huggingface}
path – (str) either a local file name or dataset name to download from remote
src_lang – (str) language code for source
trg_lang – (str) language code for target
split – (str) one of {train, dev, test}
tokenizer – tokenizer objects for both source and target
sequence_encoder – encoding functions for both source and target
has_prompt – prompt indicators
random_subset – (int) number of random subset; -1 means no subsampling

Returns:

loaded Dataset

joeynmt.decoders module#

Various decoders

class joeynmt.decoders.Decoder(*args, **kwargs)[source]#

Bases: Module

Base decoder class

property output_size#

Return the output size (size of the target vocabulary)

Returns:

class joeynmt.decoders.RecurrentDecoder(rnn_type: str = 'gru', emb_size: int = 0, hidden_size: int = 0, encoder: Encoder = None, attention: str = 'bahdanau', num_layers: int = 1, vocab_size: int = 0, dropout: float = 0.0, emb_dropout: float = 0.0, hidden_dropout: float = 0.0, init_hidden: str = 'bridge', input_feeding: bool = True, freeze: bool = False, **kwargs)[source]#

Bases: Decoder

A conditional RNN decoder with attention.

forward(trg_embed: Tensor, encoder_output: Tensor, encoder_hidden: Tensor, src_mask: Tensor, unroll_steps: int, hidden: Tensor = None, prev_att_vector: Tensor = None, **kwargs) → Tuple[Tensor, Tensor, Tensor, Tensor, Tensor][source]#

Unroll the decoder one step at a time for unroll_steps steps. For every step, the _forward_step function is called internally.

During training, the target inputs (trg_embed’) are already known for the full sequence, so the full unrol is done. In this case, `hidden and prev_att_vector are None.

For inference, this function is called with one step at a time since embedded targets are the predictions from the previous time step. In this case, hidden and prev_att_vector are fed from the output of the previous call of this function (from the 2nd step on).

src_mask is needed to mask out the areas of the encoder states that should not receive any attention, which is everything after the first <eos>.

The encoder_output are the hidden states from the encoder and are used as context for the attention.

The encoder_hidden is the last encoder hidden state that is used to initialize the first hidden decoder state (when self.init_hidden_option is “bridge” or “last”).

Parameters:

trg_embed – embedded target inputs, shape (batch_size, trg_length, embed_size)
encoder_output – hidden states from the encoder, shape (batch_size, src_length, encoder.output_size)
encoder_hidden – last state from the encoder, shape (batch_size, encoder.output_size)
src_mask – mask for src states: 0s for padded areas, 1s for the rest, shape (batch_size, 1, src_length)
unroll_steps – number of steps to unroll the decoder RNN
hidden – previous decoder hidden state, if not given it’s initialized as in self.init_hidden, shape (batch_size, num_layers, hidden_size)
prev_att_vector – previous attentional vector, if not given it’s initialized with zeros, shape (batch_size, 1, hidden_size)

Returns:

outputs: shape (batch_size, unroll_steps, vocab_size),
hidden: last hidden state (batch_size, num_layers, hidden_size),
att_probs: attention probabilities
with shape (batch_size, unroll_steps, src_length),
att_vectors: attentional vectors
with shape (batch_size, unroll_steps, hidden_size)

class joeynmt.decoders.TransformerDecoder(num_layers: int = 4, num_heads: int = 8, hidden_size: int = 512, ff_size: int = 2048, dropout: float = 0.1, emb_dropout: float = 0.1, vocab_size: int = 1, freeze: bool = False, **kwargs)[source]#

Bases: Decoder

A transformer decoder with N masked layers. Decoder layers are masked so that an attention head cannot see the future.

forward(trg_embed: Tensor, encoder_output: Tensor, encoder_hidden: Tensor, src_mask: Tensor, unroll_steps: int, hidden: Tensor, trg_mask: Tensor, **kwargs)[source]#

Transformer decoder forward pass.

Parameters:

trg_embed – embedded targets
encoder_output – source representations
encoder_hidden – unused
src_mask –
unroll_steps – unused
hidden – unused
trg_mask – to mask out target paddings Note that a subsequent mask is applied here.
kwargs –

Returns:

decoder_output: shape (batch_size, seq_len, vocab_size)
decoder_hidden: shape (batch_size, seq_len, emb_size)
att_probs: shape (batch_size, trg_length, src_length),
None

joeynmt.embeddings module#

Embedding module

class joeynmt.embeddings.Embeddings(embedding_dim: int = 64, scale: bool = False, vocab_size: int = 0, padding_idx: int = 1, freeze: bool = False, **kwargs)[source]#

Bases: Module

Simple embeddings class

forward(x: Tensor) → Tensor[source]#

Perform lookup for input x in the embedding table.

Parameters:: x – index in the vocabulary
Returns:: embedded representation for x

load_from_file(embed_path: Path, vocab: Vocabulary) → None[source]#

Load pretrained embedding weights from text file.

First line is expected to contain vocabulary size and dimension. The dimension has to match the model’s specified embedding size, the vocabulary size is used in logging only.
Each line should contain word and embedding weights separated by spaces.
The pretrained vocabulary items that are not part of the joeynmt’s vocabulary will be ignored (not loaded from the file).
The initialization (specified in config[“model”][“embed_initializer”]) of joeynmt’s vocabulary items that are not part of the pretrained vocabulary will be kept (not overwritten in this func).
This function should be called after initialization!

Example:: 2 5 the -0.0230 -0.0264 0.0287 0.0171 0.1403 at -0.0395 -0.1286 0.0275 0.0254 -0.0932

Parameters:

embed_path – embedding weights text file
vocab – Vocabulary object

joeynmt.encoders module#

Various encoders

class joeynmt.encoders.Encoder(*args, **kwargs)[source]#

Bases: Module

Base encoder class

property output_size#

Return the output size

Returns:

class joeynmt.encoders.RecurrentEncoder(rnn_type: str = 'gru', hidden_size: int = 1, emb_size: int = 1, num_layers: int = 1, dropout: float = 0.0, emb_dropout: float = 0.0, bidirectional: bool = True, freeze: bool = False, **kwargs)[source]#

Bases: Encoder

Encodes a sequence of word embeddings

forward(src_embed: Tensor, src_length: Tensor, mask: Tensor, **kwargs) → Tuple[Tensor, Tensor, Tensor][source]#

Applies a bidirectional RNN to sequence of embeddings x. The input mini-batch x needs to be sorted by src length. x and mask should have the same dimensions [batch, time, dim].

Parameters:

src_embed – embedded src inputs, shape (batch_size, src_len, embed_size)
src_length – length of src inputs (counting tokens before padding), shape (batch_size)
mask – indicates padding areas (zeros where padding), shape (batch_size, src_len, embed_size)
kwargs –

Returns:

output: hidden states with
shape (batch_size, max_length, directions*hidden),
hidden_concat: last hidden state with
shape (batch_size, directions*hidden)

class joeynmt.encoders.TransformerEncoder(hidden_size: int = 512, ff_size: int = 2048, num_layers: int = 8, num_heads: int = 4, dropout: float = 0.1, emb_dropout: float = 0.1, freeze: bool = False, **kwargs)[source]#

Bases: Encoder

Transformer Encoder

forward(src_embed: Tensor, src_length: Tensor, mask: Tensor = None, **kwargs) → Tuple[Tensor, Tensor][source]#

Pass the input (and mask) through each layer in turn. Applies a Transformer encoder to sequence of embeddings x. The input mini-batch x needs to be sorted by src length. x and mask should have the same dimensions [batch, time, dim].

Parameters:

src_embed – embedded src inputs, shape (batch_size, src_len, embed_size)
src_length – length of src inputs (counting tokens before padding), shape (batch_size)
mask – indicates padding areas (zeros where padding), shape (batch_size, 1, src_len)
kwargs –

Returns:

output: hidden states with shape (batch_size, max_length, hidden)
None

joeynmt.helpers module#

Collection of helper functions

joeynmt.helpers.adjust_mask_size(mask: Tensor, batch_size: int, hyp_len: int) → Tensor[source]#

Adjust mask size along dim=1. used for forced decoding (trg prompting).

Parameters:

mask – trg prompt mask in shape (batch_size, hyp_len)
batch_size –
hyp_len –

joeynmt.helpers.check_version(cfg_version: str = None) → str[source]#

Check joeynmt version

Parameters:: cfg_version – version number specified in config
Returns:: package version number string

joeynmt.helpers.clones(module: Module, n: int) → ModuleList[source]#

Produce N identical layers. Transformer helper function.

Parameters:

module – the module to clone
n – clone this many times

Returns:

cloned modules

joeynmt.helpers.delete_ckpt(to_delete: Path) → None[source]#

Delete checkpoint

Parameters:: to_delete – checkpoint file to be deleted

joeynmt.helpers.expand_reverse_index(reverse_index: List[int], n_best: int = 1) → List[int][source]#

Expand resort_reverse_index for n_best prediction

ex. 1) reverse_index = [1, 0, 2] and n_best = 2, then this will return [2, 3, 0, 1, 4, 5].

ex. 2) reverse_index = [1, 0, 2] and n_best = 3, then this will return [3, 4, 5, 0, 1, 2, 6, 7, 8]

Parameters:

reverse_index – reverse_index returned from batch.sort_by_src_length()
n_best –

Returns:

expanded sort_reverse_index

joeynmt.helpers.flatten(array: List[List[Any]]) → List[Any][source]#

Flatten a nested 2D list. faster even with a very long array than [item for subarray in array for item in subarray] or newarray.extend().

Parameters:: array – a nested list
Returns:: flattened list

joeynmt.helpers.freeze_params(module: Module) → None[source]#

Freeze the parameters of this module, i.e. do not update them during training

Parameters:: module – freeze parameters of this module

joeynmt.helpers.get_latest_checkpoint(ckpt_dir: Path) → Path | None[source]#

Returns the latest checkpoint (by creation time, not the steps number!) from the given directory. If there is no checkpoint in this directory, returns None

Parameters:: ckpt_dir –
Returns:: latest checkpoint file

joeynmt.helpers.load_checkpoint(path: Path, map_location: device | Dict) → Dict[source]#

Load model from saved checkpoint.

Parameters:

path – path to checkpoint
device – cuda device name or cpu

Returns:

checkpoint (dict)

joeynmt.helpers.make_model_dir(model_dir: Path, overwrite: bool = False) → None[source]#

Create a new directory for the model.

Parameters:

model_dir – path to model directory
overwrite – whether to overwrite an existing directory

joeynmt.helpers.read_list_from_file(input_path: Path) → List[str][source]#

Read list of str from file in input_path.

Parameters:: input_path – input file path
Returns:: list of strings

joeynmt.helpers.remove_extra_spaces(s: str) → str[source]#

Remove extra spaces - used in pre_process() / post_process() in tokenizer.py

Parameters:: s – input string
Returns:: string w/o extra white spaces

joeynmt.helpers.resolve_ckpt_path(load_model: Path, model_dir: Path) → Path[source]#

Get checkpoint path. if load_model is not specified, take the best or latest checkpoint from model dir.

Parameters:

load_model – Path(cfg[‘training’][‘load_model’]) or Path(cfg[‘testing’][‘load_model’])
model_dir – Path(cfg[‘model_dir’])

Returns:

resolved checkpoint path

joeynmt.helpers.save_hypothese(output_path: Path, hypotheses: List[str], n_best: str = 1) → None[source]#

Save list hypothese to file.

Parameters:

output_path – output file path
hypotheses – hypothese to write
n_best – n_best size

joeynmt.helpers.set_seed(seed: int) → None[source]#

Set the random seed for modules torch, numpy and random.

Parameters:: seed – random seed

joeynmt.helpers.store_attention_plots(attentions: ndarray, targets: List[List[str]], sources: List[List[str]], output_prefix: str, indices: List[int], tb_writer: SummaryWriter | None = None, steps: int = 0) → None[source]#

Saves attention plots.

Parameters:

attentions – attention scores
targets – list of tokenized targets
sources – list of tokenized sources
output_prefix – prefix for attention plots
indices – indices selected for plotting
tb_writer – Tensorboard summary writer (optional)
steps – current training steps, needed for tb_writer
dpi – resolution for images

joeynmt.helpers.subsequent_mask(size: int) → Tensor[source]#

Mask out subsequent positions (to prevent attending to future positions) Transformer helper function.

Parameters:: size – size of mask (2nd and 3rd dim)
Returns:: Tensor with 0s and 1s of shape (1, size, size)

joeynmt.helpers.symlink_update(target: Path, link_name: Path) → Path | None[source]#

This function finds the file that the symlink currently points to, sets it to the new target, and returns the previous target if it exists.

Parameters:

target – A path to a file that we want the symlink to point to. no parent dir, filename only, i.e. “10000.ckpt”
link_name – This is the name of the symlink that we want to update. link name with parent dir, i.e. “models/my_model/best.ckpt”

Returns:

current_last: This is the previous target of the symlink, before it is
updated in this function. If the symlink did not exist before or did not have a target, None is returned instead.

joeynmt.helpers.tile(x: Tensor, count: int, dim=0) → Tensor[source]#

Tiles x on dimension dim count times. From OpenNMT. Used for beam search.

Parameters:

x – tensor to tile
count – number of tiles
dim – dimension along which the tensor is tiled

Returns:

tiled tensor

joeynmt.helpers.unicode_normalize(s: str) → str[source]#

apply unicodedata NFKC normalization - used in pre_process() in tokenizer.py

Parameters:: s – input string
Returns:: normalized string

joeynmt.helpers.write_list_to_file(output_path: Path, array: List[Any]) → None[source]#

Write list of str to file in output_path.

Parameters:

output_path – output file path
array – list of strings

joeynmt.initialization module#

Implements custom initialization

joeynmt.initialization.compute_alpha_beta(num_enc_layers: int, num_dec_layers: int) → Dict[str, Dict][source]#: DeepNet: compute alpha/beta value suggested in https://arxiv.org/abs/2203.00555

joeynmt.initialization.initialize_model(model: Module, cfg: dict, src_padding_idx: int, trg_padding_idx: int) → None[source]#

This initializes a model based on the provided config.

All initializer configuration is part of the model section of the configuration file. For an example, see e.g. https://github.com/joeynmt/joeynmt/blob/main/ configs/iwslt14_ende_spm.yaml.

The main initializer is set using the initializer key. Possible values are xavier, uniform, normal or zeros. (xavier is the default).

When an initializer is set to uniform, then init_weight sets the range for the values (-init_weight, init_weight).

When an initializer is set to normal, then init_weight sets the standard deviation for the weights (with mean 0).

The word embedding initializer is set using embed_initializer and takes the same values. The default is normal with embed_init_weight = 0.01.

Biases are initialized separately using bias_initializer. The default is zeros, but you can use the same initializers as the main initializer.

Set init_rnn_orthogonal to True if you want RNN orthogonal initialization (for recurrent matrices). Default is False.

lstm_forget_gate controls how the LSTM forget gate is initialized. Default is 1.

Parameters:

model – model to initialize
cfg – the model configuration
src_padding_idx – index of source padding token
trg_padding_idx – index of target padding token

joeynmt.initialization.lstm_forget_gate_init_(cell: RNNBase, value: float = 1.0) → None[source]#

Initialize LSTM forget gates with value.

Parameters:

cell – LSTM cell
value – initial value, default: 1

joeynmt.initialization.orthogonal_rnn_init_(cell: RNNBase, gain: float = 1.0) → None[source]#: Orthogonal initialization of recurrent weights RNN parameters contain 3 or 4 matrices in one parameter, so we slice it.

joeynmt.initialization.xavier_uniform_n_(w: Tensor, gain: float = 1.0, n: int = 4) → None[source]#

Xavier initializer for parameters that combine multiple matrices in one parameter for efficiency. This is e.g. used for GRU and LSTM parameters, where e.g. all gates are computed at the same time by 1 big matrix.

Parameters:

w – parameter
gain – default 1
n – default 4

joeynmt.metrics module#

Evaluation metrics

joeynmt.metrics.bleu(hypotheses: List[str], references: List[str], **sacrebleu_cfg) → float[source]#

Raw corpus BLEU from sacrebleu (without tokenization) cf. https://github.com/mjpost/sacrebleu/blob/master/sacrebleu/metrics/bleu.py

Parameters:

hypotheses – list of hypotheses (strings)
references – list of references (strings)

Returns:

bleu score

joeynmt.metrics.chrf(hypotheses: List[str], references: List[str], **sacrebleu_cfg) → float[source]#

Character F-score from sacrebleu cf. https://github.com/mjpost/sacrebleu/blob/master/sacrebleu/metrics/chrf.py

Parameters:

hypotheses – list of hypotheses (strings)
references – list of references (strings)

Returns:

character f-score (0 <= chf <= 1) see Breaking Change in sacrebleu v2.0

joeynmt.metrics.sequence_accuracy(hypotheses: List[str], references: List[str]) → float[source]#

Compute the accuracy of hypothesis tokens: correct tokens / all tokens Tokens are correct if they appear in the same position in the reference. We lookup the references before one-hot-encoding, that is, hypotheses with UNK are always evaluated as incorrect.

Parameters:

hypotheses – list of hypotheses (strings)
references – list of references (strings)

Returns:

joeynmt.metrics.token_accuracy(hypotheses: List[str], references: List[str], tokenizer: Callable) → float[source]#

Compute the accuracy of hypothesis tokens: correct tokens / all tokens Tokens are correct if they appear in the same position in the reference. We lookup the references before one-hot-encoding, that is, UNK generation in hypotheses is always evaluated as incorrect.

Parameters:

hypotheses – list of hypotheses (strings)
references – list of references (strings)

Returns:

token accuracy (float)

joeynmt.model module#

Module to represents whole models

class joeynmt.model.DataParallelWrapper(module: Module)[source]#

Bases: Module

DataParallel wrapper to pass through the model attributes

ex. 1) for DataParallel

>>> from torch.nn import DataParallel as DP
>>> model = DataParallelWrapper(DP(model))

ex. 2) for DistributedDataParallel

>>> from torch.nn.parallel import DistributedDataParallel as DDP
>>> model = DataParallelWrapper(DDP(model))

forward(*args, **kwargs)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

load_state_dict(*args, **kwargs)[source]#: loading the twice-wrapped module.

state_dict(*args, **kwargs)[source]#: saving the twice-wrapped module.

class joeynmt.model.Model(encoder: Encoder, decoder: Decoder, src_embed: Embeddings, trg_embed: Embeddings, src_vocab: Vocabulary, trg_vocab: Vocabulary)[source]#

Bases: Module

Base Model class

forward(return_type: str = None, **kwargs) → Tuple[Tensor, Tensor, Tensor, Tensor][source]#

Interface for multi-gpu

For DataParallel, We need to encapsulate all model call: model.encode(), model.decode(), and model.encode_decode() by model.__call__(). model.__call__() triggers model.forward() together with pre hooks and post hooks, which takes care of multi-gpu distribution.

Parameters:: return_type – one of {“loss”, “encode”, “decode”}

log_parameters_list() → None[source]#: Write all model parameters (name, shape) to the log.

property loss_function#

joeynmt.model.build_model(cfg: Dict = None, src_vocab: Vocabulary = None, trg_vocab: Vocabulary = None) → Model[source]#

Build and initialize the model according to the configuration.

Parameters:

cfg – dictionary configuration containing model specifications
src_vocab – source vocabulary
trg_vocab – target vocabulary

Returns:

built and initialized model

joeynmt.plotting module#

Plot attentions

joeynmt.plotting.plot_heatmap(scores: ndarray, column_labels: List[str], row_labels: List[str], output_path: str | None = None, dpi: int = 300) → Figure[source]#

Plotting function that can be used to visualize (self-)attention. Plots are saved if output_path is specified, in format that this file ends with (‘pdf’ or ‘png’).

Parameters:

scores – attention scores
column_labels – labels for columns (e.g. target tokens)
row_labels – labels for rows (e.g. source tokens)
output_path – path to save to
dpi – set resolution for matplotlib

Returns:

pyplot figure

joeynmt.prediction module#

This module holds methods for generating predictions from a model.

joeynmt.prediction.evaluate(valid_scores: Dict, valid_hyp: List, data: Dataset, args: TestConfig) → Tuple[Dict[str, float], List[str]][source]#

Compute evaluateion metrics

Parameters:

valid_scores – scores dict
valid_hyp – decoded hypotheses
data – eval Dataset
args – configuration args

Returns:

valid_scores: evaluation scores
valid_ref: postprocessed references

joeynmt.prediction.predict(model: Model, data: Dataset, device: device, n_gpu: int, rank: int = 0, compute_loss: bool = False, normalization: str = 'batch', num_workers: int = 0, args: TestConfig = None, autocast: Dict = None) → Tuple[Dict[str, float], List[str] | None, List[str] | None, List[List[str]], List[ndarray], List[ndarray]][source]#

Generates translations for the given data. If compute_loss is True and references are given, also computes the loss.

Parameters:

model – model module
data – dataset for validation
device – torch device
n_gpu – number of GPUs
rank – ddp rank
compute_loss – whether to compute a scalar loss for given inputs and targets
normalization – one of {batch, tokens, none}
num_workers – number of workers for collate_fn() in data iterator
args – configuration args
autocast – autocast context

Returns:

valid_scores: (dict) current validation scores,
valid_ref: (list of str) post-processed validation references,
valid_hyp: (list of str) post-processed validation hypotheses,
decoded_valid: (list of list of str) token-level validation hypotheses,
valid_seq_scores: (list of np.array) log probabilities (hyp or ref)
valid_attn_scores: (list of np.array) attention scores (hyp or ref)

joeynmt.prediction.prepare(args: BaseConfig, rank: int, mode: str) → Tuple[Model, Dataset, Dataset, Dataset][source]#

Helper function for model and data loading.

Parameters:

args – config args
rank – ddp rank
mode – execution mode

joeynmt.prediction.test(cfg: Dict, output_path: str = None, prepared: Dict = None, save_attention: bool = False, save_scores: bool = False) → None[source]#

Main test function. Handles loading a model from checkpoint, generating translations, storing them, and plotting attention.

Parameters:

cfg – configuration dict
output_path – path to output
prepared – model and datasets passed from training
save_attention – whether to save attention visualizations
save_scores – whether to save scores

joeynmt.prediction.translate(cfg: Dict, output_path: str = None) → None[source]#

Interactive translation function. Loads model from checkpoint and translates either the stdin input or asks for input to translate interactively. Translations and scores are printed to stdout. Note: The input sentences don’t have to be pre-tokenized.

Parameters:

cfg – configuration dict
output_path – path to output file

joeynmt.search module#

Search module

joeynmt.search.beam_search(model: Model, beam_size: int, encoder_output: Tensor, encoder_hidden: Tensor, src_mask: Tensor, max_output_length: int, alpha: float, n_best: int = 1, **kwargs) → Tuple[Tensor, Tensor, Tensor][source]#

Beam search with size k. In each decoding step, find the k most likely partial hypotheses. Inspired by OpenNMT-py, adapted for Transformer.

Parameters:

model –
beam_size – size of the beam
encoder_output –
encoder_hidden –
src_mask –
max_output_length –
alpha – alpha factor for length penalty
n_best – return this many hypotheses, <= beam (currently only 1)

Returns:

stacked_output: output hypotheses (2d array of indices),
stacked_scores: scores (2d array of sequence-wise log probabilities),
stacked_attention_scores: attention scores (3d array)

joeynmt.search.greedy(src_mask: Tensor, max_output_length: int, model: Model, encoder_output: Tensor, encoder_hidden: Tensor, **kwargs) → Tuple[Tensor, Tensor, Tensor][source]#

Greedy decoding. Select the token word highest probability at each time step. This function is a wrapper that calls recurrent_greedy for recurrent decoders and transformer_greedy for transformer decoders.

Parameters:

src_mask – mask for source inputs, 0 for positions after </s>
max_output_length – maximum length for the hypotheses
model – model to use for greedy decoding
encoder_output – encoder hidden states for attention
encoder_hidden – encoder last state for decoder initialization

Returns:

stacked_output: output hypotheses (2d array of indices),
stacked_scores: scores (2d array of token-wise log probabilities),
stacked_attention_scores: attention scores (3d array)

joeynmt.search.search(model: Model, batch: Batch, max_output_length: int, beam_size: int, beam_alpha: float, n_best: int = 1, **kwargs) → Tuple[ndarray, ndarray, ndarray][source]#

Get outputs and attentions scores for a given batch.

Parameters:

model – Model class
batch – batch to generate hypotheses for
max_output_length – maximum length of hypotheses
beam_size – size of the beam for beam search, if 0 use greedy
beam_alpha – alpha value for beam search
n_best – candidates to return

Returns:

stacked_output: hypotheses for batch,
stacked_scores: log probabilities for batch,
stacked_attention_scores: attention scores for batch

joeynmt.tokenizers module#

Tokenizer module

class joeynmt.tokenizers.BasicTokenizer(level: str = 'word', lowercase: bool = False, normalize: bool = False, max_length: int = -1, min_length: int = -1, **kwargs)[source]#

Bases: object

SPACE = ' '#

SPACE_ESCAPE = '▁'#

post_process(sequence: List[str] | str, generate_unk: bool = True, cut_at_sep: bool = True) → str[source]#: Detokenize

pre_process(raw_input: str, allow_empty: bool = False) → str[source]#

Pre-process text

ex.) Lowercase, Normalize, Remove emojis,
Pre-tokenize(add extra white space before punc) etc.
applied for all inputs both in training and inference.

Parameters:

raw_input – raw input string
allow_empty – whether to allow empty string

Returns:

preprocessed input string

set_vocab(vocab) → None[source]#: Set vocab :param vocab: (Vocabulary)

class joeynmt.tokenizers.SentencePieceTokenizer(level: str = 'bpe', lowercase: bool = False, normalize: bool = False, max_length: int = -1, min_length: int = -1, **kwargs)[source]#

Bases: BasicTokenizer

copy_cfg_file(model_dir: Path) → None[source]#: Copy config file to model_dir

post_process(sequence: List[str] | str, generate_unk: bool = True, cut_at_sep: bool = True) → str[source]#: Detokenize

set_vocab(vocab) → None[source]#: Set vocab

class joeynmt.tokenizers.SubwordNMTTokenizer(level: str = 'bpe', lowercase: bool = False, normalize: bool = False, max_length: int = -1, min_length: int = -1, **kwargs)[source]#

Bases: BasicTokenizer

copy_cfg_file(model_dir: Path) → None[source]#: Copy config file to model_dir

post_process(sequence: List[str] | str, generate_unk: bool = True, cut_at_sep: bool = True) → str[source]#: Detokenize

set_vocab(vocab) → None[source]#: Set vocab

joeynmt.tokenizers.build_tokenizer(cfg: Dict) → Dict[str, BasicTokenizer][source]#

joeynmt.training module#

Training module

class joeynmt.training.TrainManager(rank: int, model: Model, model_dir: Path, device: device, n_gpu: int = 0, num_workers: int = 0, autocast: Dict = None, seed: int = 42, train_args: TrainConfig = None, dev_args: TestConfig = None)[source]#

Bases: object

Manages training loop, validations, learning rate scheduling and early stopping.

class TrainStatistics(minimize_metric: bool = True)[source]#

Bases: object

Train Statistics

Parameters:

epochs – epoch counter
steps – global update step counter
is_min_lr – stop by reaching learning rate minimum
is_max_update – stop by reaching max num of updates
total_tokens – number of total tokens seen so far
best_ckpt_iter – store iteration point of best ckpt
minimize_metric – minimize or maximize score
total_correct – number of correct tokens seen so far

is_best(score) → bool[source]#

is_better(score: float, heap_queue: list) → bool[source]#

load_state_dict(state_dict: Dict) → None[source]#: Given a state_dict, this function reconstruct the state

state_dict() → Dict[source]#: Returns a dictionary of values necessary to reconstruct stats

init_from_checkpoint(path: Path, reset_best_ckpt: bool = False, reset_scheduler: bool = False, reset_optimizer: bool = False, reset_iter_state: bool = False) → None[source]#

Initialize the trainer from a given checkpoint file.

This checkpoint file contains not only model parameters, but also scheduler and optimizer states, see self._save_checkpoint.

Parameters:

path – path to checkpoint
reset_best_ckpt – reset tracking of the best checkpoint, use for domain adaptation with a new dev set.
reset_scheduler – reset the learning rate scheduler, and do not use the one stored in the checkpoint.
reset_optimizer – reset the optimizer, and do not use the one stored in the checkpoint.
reset_iter_state – reset the sampler’s internal state and do not use the one stored in the checkpoint.

init_layers(path: Path, layer: str) → None[source]#

Initialize encoder decoder layers from a given checkpoint file.

Parameters:

path – path to checkpoint
layer – layer name; ‘encoder’ or ‘decoder’ expected

train_and_validate(train_data: Dataset, valid_data: Dataset) → None[source]#

Train the model and validate it from time to time on the validation set.

Parameters:

train_data – training data
valid_data – validation data

joeynmt.training.train(rank: int, world_size: int, cfg: Dict, skip_test: bool = False) → None[source]#

Main training function. After training, also test on test data if given.

Parameters:

rank – ddp local rank
world_size – ddp world size
cfg – configuration dict
skip_test – whether a test should be run or not after training

joeynmt.vocabulary module#

Vocabulary module

class joeynmt.vocabulary.Vocabulary(tokens: List[str], cfg: SimpleNamespace)[source]#

Bases: object

Vocabulary represents mapping between tokens and indices.

add_tokens(tokens: List[str]) → None[source]#

Add list of tokens to vocabulary

Parameters:: tokens – list of tokens to add to the vocabulary

arrays_to_sentences(arrays: ndarray, cut_at_eos: bool = True, skip_pad: bool = True) → List[List[str]][source]#

Convert multiple arrays containing sequences of token IDs to their sentences, optionally cutting them off at the end-of-sequence token.

Parameters:

arrays – 2D array containing indices
cut_at_eos – cut the decoded sentences at the first <eos>
skip_pad – skip generated <pad> tokens

Returns:

list of list of strings (tokens)

is_unk(token: str) → bool[source]#

Check whether a token is covered by the vocabulary

Parameters:: token –
Returns:: True if covered, False otherwise

log_vocab(k: int) → str[source]#: first k vocab entities

lookup(token: str) → int[source]#

look up the encoding dictionary. (needed for multiprocessing)

Parameters:: token – surface str
Returns:: token id

sentences_to_ids(sentences: List[List[str]], bos: bool = True, eos: bool = True) → Tuple[List[List[int]], List[int], List[int]][source]#

Encode sentences to indices and pad sequences to the maximum length of the sentences given

Parameters:

sentences – list of tokenized sentences
bos – whether to add <bos>
eos – whether to add <eos>

Returns:

padded ids
original lengths before padding
prompt_mask

to_file(file: Path) → None[source]#

Save the vocabulary to a file, by writing token with index i in line i.

Parameters:: file – path to file where the vocabulary is written

joeynmt.vocabulary.build_vocab(cfg: Dict, dataset: BaseDataset = None, model_dir: Path = None) → Tuple[Vocabulary, Vocabulary][source]#

joeynmt.vocabulary.sort_and_cut(counter: Counter, max_size: int = 9223372036854775807, min_freq: int = -1) → List[str][source]#: Cut counter to most frequent, sorted numerically and alphabetically :param counter: flattened token list in Counter object :param max_size: maximum size of vocabulary :param min_freq: minimum frequency for an item to be included :return: list of valid tokens

joeynmt.loss module#

Module to implement training loss

class joeynmt.loss.XentLoss(pad_index: int, smoothing: float = 0.0)[source]#

Bases: Module

Cross-Entropy Loss with optional label smoothing

forward(log_probs: Tensor, **kwargs) → Tensor[source]#

Compute the cross-entropy between logits and targets.

If label smoothing is used, target distributions are not one-hot, but “1-smoothing” for the correct target token and the rest of the probability mass is uniformly spread across the other tokens.

Parameters:: log_probs – log probabilities as predicted by model
Returns:: logits

joeynmt.transformer_layers module#

Transformer layers

class joeynmt.transformer_layers.MultiHeadedAttention(num_heads: int, size: int, dropout: float = 0.1)[source]#

Bases: Module

Multi-Head Attention module from “Attention is All You Need”

Implementation modified from OpenNMT-py. https://github.com/OpenNMT/OpenNMT-py

forward(k: Tensor, v: Tensor, q: Tensor, mask: Tensor | None = None, return_weights: bool | None = None)[source]#

Computes multi-headed attention.

Parameters:

k – keys [batch_size, seq_len, hidden_size]
v – values [batch_size, seq_len, hidden_size]
q – query [batch_size, seq_len, hidden_size]
mask – optional mask [batch_size, 1, seq_len]
return_weights – whether to return the attention weights, averaged over heads.

Returns:

output [batch_size, query_len, hidden_size]
attention_weights [batch_size, query_len, key_len]

class joeynmt.transformer_layers.PositionalEncoding(size: int = 0, max_len: int = 5000)[source]#

Bases: Module

Pre-compute position encodings (PE). In forward pass, this adds the position-encodings to the input for as many time steps as necessary.

Implementation based on OpenNMT-py. https://github.com/OpenNMT/OpenNMT-py

forward(emb: Tensor) → Tensor[source]#

Embed inputs.

Parameters:: emb – (Tensor) Sequence of word embeddings vectors shape (seq_len, batch_size, dim)
Returns:: positionally encoded word embeddings

class joeynmt.transformer_layers.PositionwiseFeedForward(input_size: int, ff_size: int, dropout: float = 0.1, alpha: float = 1.0, layer_norm: str = 'post', activation: str = 'relu')[source]#

Bases: Module

Position-wise Feed-forward layer Projects to ff_size and then back down to input_size.

forward(x: Tensor) → Tensor[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

class joeynmt.transformer_layers.TransformerDecoderLayer(size: int = 0, ff_size: int = 0, num_heads: int = 0, dropout: float = 0.1, alpha: float = 1.0, layer_norm: str = 'post', activation: str = 'relu')[source]#

Bases: Module

Transformer decoder layer.

Consists of self-attention, source-attention, and feed-forward.

forward(x: Tensor, memory: Tensor, src_mask: Tensor, trg_mask: Tensor, return_attention: bool = False, **kwargs) → Tensor[source]#

Forward pass of a single Transformer decoder layer.

First applies target-target self-attention, dropout with residual connection (adding the input to the result), and layer norm.

Second computes source-target cross-attention, dropout with residual connection (adding the self-attention to the result), and layer norm.

Finally goes through a position-wise feed-forward layer.

Parameters:

x – inputs
memory – source representations
src_mask – source mask
trg_mask – target mask (so as not to condition on future steps)
return_attention – whether to return the attention weights

Returns:

output tensor
attention weights

class joeynmt.transformer_layers.TransformerEncoderLayer(size: int = 0, ff_size: int = 0, num_heads: int = 0, dropout: float = 0.1, alpha: float = 1.0, layer_norm: str = 'post', activation: str = 'relu')[source]#

Bases: Module

One Transformer encoder layer has a Multi-head attention layer plus a position-wise feed-forward layer.

forward(x: Tensor, mask: Tensor) → Tensor[source]#

Forward pass for a single transformer encoder layer. First applies self attention, then dropout with residual connection (adding the input to the result), then layer norm, and then a position-wise feed-forward layer.

Parameters:

x – layer input
mask – input mask

Returns:

output tensor