API Documentation#
Module contents#
Submodules#
joeynmt.attention module#
Attention modules
- class joeynmt.attention.AttentionMechanism(*args, **kwargs)[source]#
Bases:
Module
Base attention class
- forward(*inputs)[source]#
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class joeynmt.attention.BahdanauAttention(hidden_size: int = 1, key_size: int = 1, query_size: int = 1)[source]#
Bases:
AttentionMechanism
Implements Bahdanau (MLP) attention
Section A.1.2 in https://arxiv.org/abs/1409.0473.
- compute_proj_keys(keys: Tensor) None [source]#
Compute the projection of the keys. Is efficient if pre-computed before receiving individual queries.
- Parameters:
keys –
- Returns:
- compute_proj_query(query: Tensor)[source]#
Compute the projection of the query.
- Parameters:
query –
- Returns:
- forward(query: Tensor, mask: Tensor, values: Tensor) Tuple[Tensor, Tensor] [source]#
Bahdanau MLP attention forward pass.
- Parameters:
query – the item (decoder state) to compare with the keys/memory, shape (batch_size, 1, decoder.hidden_size)
mask – mask out keys position (0 in invalid positions, 1 else), shape (batch_size, 1, src_length)
values – values (encoder states), shape (batch_size, src_length, encoder.hidden_size)
- Returns:
context vector of shape (batch_size, 1, value_size),
attention probabilities of shape (batch_size, 1, src_length)
- class joeynmt.attention.LuongAttention(hidden_size: int = 1, key_size: int = 1)[source]#
Bases:
AttentionMechanism
Implements Luong (bilinear / multiplicative) attention.
Eq. 8 (“general”) in http://aclweb.org/anthology/D15-1166.
- compute_proj_keys(keys: Tensor) None [source]#
Compute the projection of the keys and assign them to self.proj_keys. This pre-computation is efficiently done for all keys before receiving individual queries.
- Parameters:
keys – shape (batch_size, src_length, encoder.hidden_size)
- forward(query: Tensor, mask: Tensor, values: Tensor) Tuple[Tensor, Tensor] [source]#
Luong (multiplicative / bilinear) attention forward pass. Computes context vectors and attention scores for a given query and all masked values and returns them.
- Parameters:
query – the item (decoder state) to compare with the keys/memory, shape (batch_size, 1, decoder.hidden_size)
mask – mask out keys position (0 in invalid positions, 1 else), shape (batch_size, 1, src_length)
values – values (encoder states), shape (batch_size, src_length, encoder.hidden_size)
- Returns:
context vector of shape (batch_size, 1, value_size),
attention probabilities of shape (batch_size, 1, src_length)
joeynmt.batch module#
Implementation of a mini-batch.
- class joeynmt.batch.Batch(src: Tensor, src_length: Tensor, src_prompt_mask: Tensor | None, trg: Tensor | None, trg_prompt_mask: Tensor | None, indices: Tensor, device: device, pad_index: int, eos_index: int, is_train: bool = True)[source]#
Bases:
object
Object for holding a batch of data with mask during training. Input is yielded from collate_fn() called by torch.data.utils.DataLoader.
- normalize(tensor: Tensor, normalization: str = 'none', n_gpu: int = 1, n_accumulation: int = 1) Tensor [source]#
Normalizes batch tensor (i.e. loss). Takes sum over multiple gpus, divides by nseqs or ntokens, divide by n_gpu, then divide by n_accumulation.
- Parameters:
tensor – (Tensor) tensor to normalize, i.e. batch loss
normalization – (str) one of {batch, tokens, none}
n_gpu – (int) the number of gpus
n_accumulation – (int) the number of gradient accumulation
- Returns:
normalized tensor
joeynmt.builders module#
Collection of builder functions
- class joeynmt.builders.BaseScheduler(optimizer: Optimizer)[source]#
Bases:
object
Base LR Scheduler decay at “step”
- class joeynmt.builders.NoamScheduler(hidden_size: int, optimizer: Optimizer, factor: float = 1.0, warmup: int = 4000)[source]#
Bases:
BaseScheduler
The Noam learning rate scheduler used in “Attention is all you need” See Eq. 3 in https://arxiv.org/abs/1706.03762
- class joeynmt.builders.WarmupExponentialDecayScheduler(optimizer: Optimizer, peak_rate: float = 0.001, decay_length: int = 10000, warmup: int = 4000, decay_rate: float = 0.5, min_rate: float = 1e-05)[source]#
Bases:
BaseScheduler
A learning rate scheduler similar to Noam, but modified: Keep the warm up period but make it so that the decay rate can be tuneable. The decay is exponential up to a given minimum rate.
- class joeynmt.builders.WarmupInverseSquareRootScheduler(optimizer: Optimizer, peak_rate: float = 0.001, warmup: int = 10000, min_rate: float = 1e-05)[source]#
Bases:
BaseScheduler
Decay the LR based on the inverse square root of the update number. In the warmup phase, we linearly increase the learning rate. After warmup, we decrease the learning rate as follows:
` decay_factor = peak_rate * sqrt(warmup) # constant value lr = decay_factor / sqrt(step) `
cf.) https://github.com/pytorch/fairseq/blob/main/fairseq/optim/lr_scheduler/inverse_square_root_schedule.py
- joeynmt.builders.build_activation(activation: str = 'relu') Callable [source]#
Returns the activation function
- joeynmt.builders.build_gradient_clipper(cfg: Dict) Callable | None [source]#
Define the function for gradient clipping as specified in configuration. If not specified, returns None.
- Current options:
- “clip_grad_val”: clip the gradients if they exceed this value,
see torch.nn.utils.clip_grad_value_
- “clip_grad_norm”: clip the gradients if their norm exceeds this value,
see torch.nn.utils.clip_grad_norm_
- Parameters:
cfg – dictionary with training configurations
- Returns:
clipping function (in-place) or None if no gradient clipping
- joeynmt.builders.build_optimizer(cfg: Dict, parameters: Generator) Optimizer [source]#
Create an optimizer for the given parameters as specified in config.
Except for the weight decay and initial learning rate, default optimizer settings are used.
- Currently supported configuration settings for “optimizer”:
“sgd” (default): see torch.optim.SGD
“adam”: see torch.optim.adam
“adamw”: see torch.optim.adamw
“adagrad”: see torch.optim.adagrad
“adadelta”: see torch.optim.adadelta
“rmsprop”: see torch.optim.RMSprop
The initial learning rate is set according to “learning_rate” in the config. The weight decay is set according to “weight_decay” in the config. If they are not specified, the initial learning rate is set to 3.0e-4, the weight decay to 0.
Note that the scheduler state is saved in the checkpoint, so if you load a model for further training you have to use the same type of scheduler.
- Parameters:
cfg – configuration dictionary
parameters –
- Returns:
optimizer
- joeynmt.builders.build_scheduler(cfg: Dict, optimizer: Optimizer, scheduler_mode: str, hidden_size: int = 0)[source]#
Create a learning rate scheduler if specified in config and determine when a scheduler step should be executed.
- Current options:
“plateau”: see torch.optim.lr_scheduler.ReduceLROnPlateau
“decaying”: see torch.optim.lr_scheduler.StepLR
“exponential”: see torch.optim.lr_scheduler.ExponentialLR
“noam”: see joeynmt.builders.NoamScheduler
“warmupexponentialdecay”: see joeynmt.builders.WarmupExponentialDecayScheduler
“warmupinversesquareroot”: see joeynmt.builders.WarmupInverseSquareRootScheduler
If no scheduler is specified, returns (None, None) which will result in a constant learning rate.
- Parameters:
cfg – training configuration
optimizer – optimizer for the scheduler, determines the set of parameters which the scheduler sets the learning rate for
scheduler_mode – “min” or “max”, depending on whether the validation score should be minimized or maximized. Only relevant for “plateau”.
hidden_size – encoder hidden size (required for NoamScheduler)
- Returns:
scheduler: scheduler object,
scheduler_step_at: either “validation”, “epoch”, “step” or “none”
joeynmt.config module#
Module for configuration
This can only be a temporary solution. TODO: Consider better configuration and validation cf. https://github.com/joeynmt/joeynmt/issues/196
- class joeynmt.config.BaseConfig(name, joeynmt_version, model_dir, device, n_gpu, num_workers, autocast, seed, train, test, data, model)#
Bases:
tuple
- autocast: Dict#
Alias for field number 6
- data: Dict#
Alias for field number 10
- device: device#
Alias for field number 3
- joeynmt_version: str | None#
Alias for field number 1
- model: Dict#
Alias for field number 11
- model_dir: Path#
Alias for field number 2
- n_gpu: int#
Alias for field number 4
- name: str#
Alias for field number 0
- num_workers: int#
Alias for field number 5
- seed: int#
Alias for field number 7
- test: TestConfig#
Alias for field number 9
- train: TrainConfig#
Alias for field number 8
- exception joeynmt.config.ConfigurationError[source]#
Bases:
Exception
Custom exception for misspecifications of configuration
- class joeynmt.config.TestConfig(load_model, batch_size, batch_type, max_output_length, min_output_length, eval_metrics, sacrebleu_cfg, beam_size, beam_alpha, n_best, return_attention, return_prob, generate_unk, repetition_penalty, no_repeat_ngram_size)#
Bases:
tuple
- batch_size: int#
Alias for field number 1
- batch_type: str#
Alias for field number 2
- beam_alpha: int#
Alias for field number 8
- beam_size: int#
Alias for field number 7
- eval_metrics: List[str]#
Alias for field number 5
- generate_unk: bool#
Alias for field number 12
- load_model: Path | None#
Alias for field number 0
- max_output_length: int#
Alias for field number 3
- min_output_length: int#
Alias for field number 4
- n_best: int#
Alias for field number 9
- no_repeat_ngram_size: int#
Alias for field number 14
- repetition_penalty: float#
Alias for field number 13
- return_attention: bool#
Alias for field number 10
- return_prob: str#
Alias for field number 11
- sacrebleu_cfg: Dict | None#
Alias for field number 6
- class joeynmt.config.TrainConfig(load_model, load_encoder, load_decoder, loss, normalization, label_smoothing, optimizer, adam_betas, learning_rate, learning_rate_min, learning_rate_factor, learning_rate_warmup, scheduling, patience, decrease_factor, weight_decay, clip_grad_norm, clip_grad_val, keep_best_ckpts, logging_freq, validation_freq, print_valid_sents, early_stopping_metric, minimize_metric, shuffle, epochs, max_updates, batch_size, batch_type, batch_multiplier, reset_best_ckpt, reset_scheduler, reset_optimizer, reset_iter_state)#
Bases:
tuple
- adam_betas: List[float]#
Alias for field number 7
- batch_multiplier: int#
Alias for field number 29
- batch_size: int#
Alias for field number 27
- batch_type: str#
Alias for field number 28
- clip_grad_norm: float | None#
Alias for field number 16
- clip_grad_val: float | None#
Alias for field number 17
- decrease_factor: float#
Alias for field number 14
- early_stopping_metric: str#
Alias for field number 22
- epochs: int#
Alias for field number 25
- keep_best_ckpts: int#
Alias for field number 18
- label_smoothing: float#
Alias for field number 5
- learning_rate: float#
Alias for field number 8
- learning_rate_factor: int#
Alias for field number 10
- learning_rate_min: float#
Alias for field number 9
- learning_rate_warmup: int#
Alias for field number 11
- load_decoder: Path | None#
Alias for field number 2
- load_encoder: Path | None#
Alias for field number 1
- load_model: Path | None#
Alias for field number 0
- logging_freq: int#
Alias for field number 19
- loss: str#
Alias for field number 3
- max_updates: int#
Alias for field number 26
- minimize_metric: bool#
Alias for field number 23
- normalization: str#
Alias for field number 4
- optimizer: str#
Alias for field number 6
- patience: int#
Alias for field number 13
- print_valid_sents: List[int]#
Alias for field number 21
- reset_best_ckpt: bool#
Alias for field number 30
- reset_iter_state: bool#
Alias for field number 33
- reset_optimizer: bool#
Alias for field number 32
- reset_scheduler: bool#
Alias for field number 31
- scheduling: str | None#
Alias for field number 12
- shuffle: bool#
Alias for field number 24
- validation_freq: int#
Alias for field number 20
- weight_decay: float#
Alias for field number 15
- joeynmt.config.load_config(cfg_file: str = 'configs/default.yaml') Dict [source]#
Loads and parses a YAML configuration file.
- Parameters:
cfg_file – path to YAML configuration file
- Returns:
configuration dictionary
- joeynmt.config.log_config(cfg: Dict, prefix: str = 'cfg') None [source]#
Print configuration to console log.
- Parameters:
cfg – configuration to log
prefix – prefix for logging
- joeynmt.config.parse_global_args(cfg: Dict = None, rank: int = 0, mode: str = 'train') BaseConfig [source]#
Parse and validate global args
- Parameters:
cfg – config specified in yaml file
rank –
mode –
- joeynmt.config.parse_test_args(cfg: Dict = None, mode: str = 'test') TestConfig [source]#
Parse and validate test args
- Parameters:
cfg – testing section in config yaml
mode –
- joeynmt.config.parse_train_args(cfg: Dict = None, mode: str = 'train') TrainConfig [source]#
Parse and validate train args
- Parameters:
cfg – training section in config yaml
mode –
- joeynmt.config.set_validation_args(args: TestConfig) TestConfig [source]#
Config for validation
- Parameters:
args – testing section in config yaml
joeynmt.data module#
Data module
- joeynmt.data.load_data(cfg: Dict, datasets: list = None) Tuple[Vocabulary, Vocabulary, BaseDataset | None, BaseDataset | None, BaseDataset | None] [source]#
Load train, dev and optionally test data as specified in configuration. Vocabularies are created from the training set with a limit of voc_limit tokens and a minimum token frequency of voc_min_freq (specified in the configuration dictionary).
The training data is filtered to include sentences up to max_length on source and target side.
If you set random_{train|dev}_subset, a random selection of this size is used from the {train|development} set instead of the full {train|development} set.
- Parameters:
cfg – configuration dictionary for data (“data” part of config file)
datasets – list of dataset names to load
- Returns:
src_vocab: source vocabulary
trg_vocab: target vocabulary
train_data: training dataset
dev_data: development dataset
test_data: test dataset if given, otherwise None
joeynmt.datasets module#
Dataset module
- class joeynmt.datasets.BaseDataset(path: str, src_lang: str, trg_lang: str, split: str = 'train', has_trg: bool = False, has_prompt: Dict[str, bool] = None, tokenizer: Dict[str, BasicTokenizer] = None, sequence_encoder: Dict[str, Callable] = None, random_subset: int = -1)[source]#
Bases:
Dataset
BaseDataset which loads and looks up data. - holds pointer to tokenizers, encoding functions.
- Parameters:
path – path to data directory
src_lang – source language code, i.e. en
trg_lang – target language code, i.e. de
has_trg – bool indicator if trg exists
has_prompt – bool indicator if prompt exists
split – bool indicator for train set or not
tokenizer – tokenizer objects
sequence_encoder – encoding functions
- collate_fn(batch: List[Tuple], pad_index: int, eos_index: int, device: device = device(type='cpu')) Batch [source]#
Custom collate function. See https://pytorch.org/docs/stable/data.html#dataloader-collate-fn for details. Please override the batch class here. (not in TrainManager)
- Parameters:
batch –
pad_index –
eos_index –
device –
- Returns:
joeynmt batch object
- get_item(idx: int, lang: str, is_train: bool = None) List[str] [source]#
- seek one src/trg item of the given index.
tokenization is applied here.
length-filtering, bpe-dropout etc also triggered if self.split == “train”
- get_list(lang: str, tokenized: bool = False, subsampled: bool = True) List[str] | List[List[str]] [source]#
get data column-wise.
- load_data(path: Path, **kwargs) Any [source]#
- load data
preprocessing (lowercasing etc) is applied here.
- make_iter(batch_size: int, batch_type: str = 'sentence', seed: int = 42, shuffle: bool = False, num_workers: int = 0, pad_index: int = 1, eos_index: int = 3, device: device = device(type='cpu'), generator_state: Tensor = None) DataLoader [source]#
Returns a torch DataLoader for a torch Dataset. (no bucketing)
- Parameters:
batch_size – size of the batches the iterator prepares
batch_type – measure batch size by sentence count or by token count
seed – random seed for shuffling
shuffle – whether to shuffle the order of sequences before each epoch (for testing, no effect even if set to True; generator is still used for random subsampling, but not for permutation!)
num_workers – number of cpus for multiprocessing
pad_index –
eos_index –
device –
generator_state –
- Returns:
torch DataLoader
- property src: List[str]#
get detokenized preprocessed data in src language.
- property trg: List[str]#
get detokenized preprocessed data in trg language.
- class joeynmt.datasets.BaseHuggingfaceDataset(path: str, src_lang: str, trg_lang: str, has_trg: bool = True, has_prompt: Dict[str, bool] = None, tokenizer: Dict[str, BasicTokenizer] = None, sequence_encoder: Dict[str, Callable] = None, random_subset: int = -1, **kwargs)[source]#
Bases:
BaseDataset
Wrapper for Huggingface’s dataset object cf.) https://huggingface.co/docs/datasets
- COLUMN_NAME = 'sentence'#
- get_list(lang: str, tokenized: bool = False, subsampled: bool = True) List[str] | List[List[str]] [source]#
get data column-wise.
- class joeynmt.datasets.HuggingfaceTranslationDataset(path: str, src_lang: str, trg_lang: str, has_trg: bool = True, has_prompt: Dict[str, bool] = None, tokenizer: Dict[str, BasicTokenizer] = None, sequence_encoder: Dict[str, Callable] = None, random_subset: int = -1, **kwargs)[source]#
Bases:
BaseHuggingfaceDataset
Wrapper for Huggingface’s datasets.features.Translation class cf.) https://github.com/huggingface/datasets/blob/master/src/datasets/features/translation.py
- COLUMN_NAME = 'translation'#
- class joeynmt.datasets.PlaintextDataset(path: str, src_lang: str, trg_lang: str, split: str = 'train', has_trg: bool = False, has_prompt: Dict[str, bool] = None, tokenizer: Dict[str, BasicTokenizer] = None, sequence_encoder: Dict[str, Callable] = None, random_subset: int = -1, **kwargs)[source]#
Bases:
BaseDataset
PlaintextDataset which stores plain text pairs. - used for text file data in the format of one sentence per line.
- get_list(lang: str, tokenized: bool = False, subsampled: bool = True) List[str] | List[List[str]] [source]#
Return list of preprocessed sentences in the given language. (not length-filtered, no bpe-dropout)
- class joeynmt.datasets.SentenceBatchSampler(sampler: Sampler, batch_size: int, drop_last: bool, seed: int)[source]#
Bases:
BatchSampler
Wraps another sampler to yield a mini-batch of indices based on num of instances. An instance longer than dataset.max_len will be filtered out.
- Parameters:
sampler – Base sampler. Can be any iterable object
batch_size – Size of mini-batch.
drop_last – If True, the sampler will drop the last batch if its size would be less than batch_size
- property num_samples: int#
Returns number of samples in the dataset. This may change during sampling.
- Note: len(dataset) won’t change during sampling.
Use len(dataset) instead, to retrieve the original dataset length.
- class joeynmt.datasets.StreamDataset(path: str, src_lang: str, trg_lang: str, split: str = 'test', has_trg: bool = False, has_prompt: Dict[str, bool] = None, tokenizer: Dict[str, BasicTokenizer] = None, sequence_encoder: Dict[str, Callable] = None, random_subset: int = -1, **kwargs)[source]#
Bases:
BaseDataset
StreamDataset which interacts with stream inputs. - called by translate() func in prediction.py.
- class joeynmt.datasets.TokenBatchSampler(sampler: Sampler, batch_size: int, drop_last: bool, seed: int)[source]#
Bases:
SentenceBatchSampler
Wraps another sampler to yield a mini-batch of indices based on num of tokens (incl. padding). An instance longer than dataset.max_len or shorter than dataset.min_len will be filtered out. * no bucketing implemented
Warning
In DDP, we shouldn’t use TokenBatchSampler for prediction, because we cannot ensure that the data points will be distributed evenly across devices. ddp_merge() (dist.all_gather()) called in predict() can get stuck.
- Parameters:
sampler – Base sampler. Can be any iterable object
batch_size – Size of mini-batch.
drop_last – If True, the sampler will drop the last batch if its size would be less than batch_size
- class joeynmt.datasets.TsvDataset(path: str, src_lang: str, trg_lang: str, split: str = 'train', has_trg: bool = False, has_prompt: Dict[str, bool] = None, tokenizer: Dict[str, BasicTokenizer] = None, sequence_encoder: Dict[str, Callable] = None, random_subset: int = -1, **kwargs)[source]#
Bases:
BaseDataset
TsvDataset which handles data in tsv format. - file_name should be specified without extension .tsv - needs src_lang and trg_lang (i.e. en, de) in header. see: test/data/toy/dev.tsv
- get_list(lang: str, tokenized: bool = False, subsampled: bool = True) List[str] | List[List[str]] [source]#
get data column-wise.
- joeynmt.datasets.build_dataset(dataset_type: str, path: str, src_lang: str, trg_lang: str, split: str, tokenizer: Dict = None, sequence_encoder: Dict = None, has_prompt: Dict = None, random_subset: int = -1, **kwargs)[source]#
Builds a dataset.
- Parameters:
dataset_type – (str) one of {plain, tsv, stream, huggingface}
path – (str) either a local file name or dataset name to download from remote
src_lang – (str) language code for source
trg_lang – (str) language code for target
split – (str) one of {train, dev, test}
tokenizer – tokenizer objects for both source and target
sequence_encoder – encoding functions for both source and target
has_prompt – prompt indicators
random_subset – (int) number of random subset; -1 means no subsampling
- Returns:
loaded Dataset
joeynmt.decoders module#
Various decoders
- class joeynmt.decoders.Decoder(*args, **kwargs)[source]#
Bases:
Module
Base decoder class
- property output_size#
Return the output size (size of the target vocabulary)
- Returns:
- class joeynmt.decoders.RecurrentDecoder(rnn_type: str = 'gru', emb_size: int = 0, hidden_size: int = 0, encoder: Encoder = None, attention: str = 'bahdanau', num_layers: int = 1, vocab_size: int = 0, dropout: float = 0.0, emb_dropout: float = 0.0, hidden_dropout: float = 0.0, init_hidden: str = 'bridge', input_feeding: bool = True, freeze: bool = False, **kwargs)[source]#
Bases:
Decoder
A conditional RNN decoder with attention.
- forward(trg_embed: Tensor, encoder_output: Tensor, encoder_hidden: Tensor, src_mask: Tensor, unroll_steps: int, hidden: Tensor = None, prev_att_vector: Tensor = None, **kwargs) Tuple[Tensor, Tensor, Tensor, Tensor, Tensor] [source]#
Unroll the decoder one step at a time for unroll_steps steps. For every step, the _forward_step function is called internally.
During training, the target inputs (trg_embed’) are already known for the full sequence, so the full unrol is done. In this case, `hidden and prev_att_vector are None.
For inference, this function is called with one step at a time since embedded targets are the predictions from the previous time step. In this case, hidden and prev_att_vector are fed from the output of the previous call of this function (from the 2nd step on).
src_mask is needed to mask out the areas of the encoder states that should not receive any attention, which is everything after the first <eos>.
The encoder_output are the hidden states from the encoder and are used as context for the attention.
The encoder_hidden is the last encoder hidden state that is used to initialize the first hidden decoder state (when self.init_hidden_option is “bridge” or “last”).
- Parameters:
trg_embed – embedded target inputs, shape (batch_size, trg_length, embed_size)
encoder_output – hidden states from the encoder, shape (batch_size, src_length, encoder.output_size)
encoder_hidden – last state from the encoder, shape (batch_size, encoder.output_size)
src_mask – mask for src states: 0s for padded areas, 1s for the rest, shape (batch_size, 1, src_length)
unroll_steps – number of steps to unroll the decoder RNN
hidden – previous decoder hidden state, if not given it’s initialized as in self.init_hidden, shape (batch_size, num_layers, hidden_size)
prev_att_vector – previous attentional vector, if not given it’s initialized with zeros, shape (batch_size, 1, hidden_size)
- Returns:
outputs: shape (batch_size, unroll_steps, vocab_size),
hidden: last hidden state (batch_size, num_layers, hidden_size),
- att_probs: attention probabilities
with shape (batch_size, unroll_steps, src_length),
- att_vectors: attentional vectors
with shape (batch_size, unroll_steps, hidden_size)
- class joeynmt.decoders.TransformerDecoder(num_layers: int = 4, num_heads: int = 8, hidden_size: int = 512, ff_size: int = 2048, dropout: float = 0.1, emb_dropout: float = 0.1, vocab_size: int = 1, freeze: bool = False, **kwargs)[source]#
Bases:
Decoder
A transformer decoder with N masked layers. Decoder layers are masked so that an attention head cannot see the future.
- forward(trg_embed: Tensor, encoder_output: Tensor, encoder_hidden: Tensor, src_mask: Tensor, unroll_steps: int, hidden: Tensor, trg_mask: Tensor, **kwargs)[source]#
Transformer decoder forward pass.
- Parameters:
trg_embed – embedded targets
encoder_output – source representations
encoder_hidden – unused
src_mask –
unroll_steps – unused
hidden – unused
trg_mask – to mask out target paddings Note that a subsequent mask is applied here.
kwargs –
- Returns:
decoder_output: shape (batch_size, seq_len, vocab_size)
decoder_hidden: shape (batch_size, seq_len, emb_size)
att_probs: shape (batch_size, trg_length, src_length),
None
joeynmt.embeddings module#
Embedding module
- class joeynmt.embeddings.Embeddings(embedding_dim: int = 64, scale: bool = False, vocab_size: int = 0, padding_idx: int = 1, freeze: bool = False, **kwargs)[source]#
Bases:
Module
Simple embeddings class
- forward(x: Tensor) Tensor [source]#
Perform lookup for input x in the embedding table.
- Parameters:
x – index in the vocabulary
- Returns:
embedded representation for x
- load_from_file(embed_path: Path, vocab: Vocabulary) None [source]#
Load pretrained embedding weights from text file.
First line is expected to contain vocabulary size and dimension. The dimension has to match the model’s specified embedding size, the vocabulary size is used in logging only.
Each line should contain word and embedding weights separated by spaces.
The pretrained vocabulary items that are not part of the joeynmt’s vocabulary will be ignored (not loaded from the file).
The initialization (specified in config[“model”][“embed_initializer”]) of joeynmt’s vocabulary items that are not part of the pretrained vocabulary will be kept (not overwritten in this func).
This function should be called after initialization!
- Example:
2 5 the -0.0230 -0.0264 0.0287 0.0171 0.1403 at -0.0395 -0.1286 0.0275 0.0254 -0.0932
- Parameters:
embed_path – embedding weights text file
vocab – Vocabulary object
joeynmt.encoders module#
Various encoders
- class joeynmt.encoders.Encoder(*args, **kwargs)[source]#
Bases:
Module
Base encoder class
- property output_size#
Return the output size
- Returns:
- class joeynmt.encoders.RecurrentEncoder(rnn_type: str = 'gru', hidden_size: int = 1, emb_size: int = 1, num_layers: int = 1, dropout: float = 0.0, emb_dropout: float = 0.0, bidirectional: bool = True, freeze: bool = False, **kwargs)[source]#
Bases:
Encoder
Encodes a sequence of word embeddings
- forward(src_embed: Tensor, src_length: Tensor, mask: Tensor, **kwargs) Tuple[Tensor, Tensor, Tensor] [source]#
Applies a bidirectional RNN to sequence of embeddings x. The input mini-batch x needs to be sorted by src length. x and mask should have the same dimensions [batch, time, dim].
- Parameters:
src_embed – embedded src inputs, shape (batch_size, src_len, embed_size)
src_length – length of src inputs (counting tokens before padding), shape (batch_size)
mask – indicates padding areas (zeros where padding), shape (batch_size, src_len, embed_size)
kwargs –
- Returns:
- output: hidden states with
shape (batch_size, max_length, directions*hidden),
- hidden_concat: last hidden state with
shape (batch_size, directions*hidden)
- class joeynmt.encoders.TransformerEncoder(hidden_size: int = 512, ff_size: int = 2048, num_layers: int = 8, num_heads: int = 4, dropout: float = 0.1, emb_dropout: float = 0.1, freeze: bool = False, **kwargs)[source]#
Bases:
Encoder
Transformer Encoder
- forward(src_embed: Tensor, src_length: Tensor, mask: Tensor = None, **kwargs) Tuple[Tensor, Tensor] [source]#
Pass the input (and mask) through each layer in turn. Applies a Transformer encoder to sequence of embeddings x. The input mini-batch x needs to be sorted by src length. x and mask should have the same dimensions [batch, time, dim].
- Parameters:
src_embed – embedded src inputs, shape (batch_size, src_len, embed_size)
src_length – length of src inputs (counting tokens before padding), shape (batch_size)
mask – indicates padding areas (zeros where padding), shape (batch_size, 1, src_len)
kwargs –
- Returns:
output: hidden states with shape (batch_size, max_length, hidden)
None
joeynmt.helpers module#
Collection of helper functions
- joeynmt.helpers.adjust_mask_size(mask: Tensor, batch_size: int, hyp_len: int) Tensor [source]#
Adjust mask size along dim=1. used for forced decoding (trg prompting).
- Parameters:
mask – trg prompt mask in shape (batch_size, hyp_len)
batch_size –
hyp_len –
- joeynmt.helpers.check_version(cfg_version: str = None) str [source]#
Check joeynmt version
- Parameters:
cfg_version – version number specified in config
- Returns:
package version number string
- joeynmt.helpers.clones(module: Module, n: int) ModuleList [source]#
Produce N identical layers. Transformer helper function.
- Parameters:
module – the module to clone
n – clone this many times
- Returns:
cloned modules
- joeynmt.helpers.delete_ckpt(to_delete: Path) None [source]#
Delete checkpoint
- Parameters:
to_delete – checkpoint file to be deleted
- joeynmt.helpers.expand_reverse_index(reverse_index: List[int], n_best: int = 1) List[int] [source]#
Expand resort_reverse_index for n_best prediction
ex. 1) reverse_index = [1, 0, 2] and n_best = 2, then this will return [2, 3, 0, 1, 4, 5].
ex. 2) reverse_index = [1, 0, 2] and n_best = 3, then this will return [3, 4, 5, 0, 1, 2, 6, 7, 8]
- Parameters:
reverse_index – reverse_index returned from batch.sort_by_src_length()
n_best –
- Returns:
expanded sort_reverse_index
- joeynmt.helpers.flatten(array: List[List[Any]]) List[Any] [source]#
Flatten a nested 2D list. faster even with a very long array than [item for subarray in array for item in subarray] or newarray.extend().
- Parameters:
array – a nested list
- Returns:
flattened list
- joeynmt.helpers.freeze_params(module: Module) None [source]#
Freeze the parameters of this module, i.e. do not update them during training
- Parameters:
module – freeze parameters of this module
- joeynmt.helpers.get_latest_checkpoint(ckpt_dir: Path) Path | None [source]#
Returns the latest checkpoint (by creation time, not the steps number!) from the given directory. If there is no checkpoint in this directory, returns None
- Parameters:
ckpt_dir –
- Returns:
latest checkpoint file
- joeynmt.helpers.load_checkpoint(path: Path, map_location: device | Dict) Dict [source]#
Load model from saved checkpoint.
- Parameters:
path – path to checkpoint
device – cuda device name or cpu
- Returns:
checkpoint (dict)
- joeynmt.helpers.make_model_dir(model_dir: Path, overwrite: bool = False) None [source]#
Create a new directory for the model.
- Parameters:
model_dir – path to model directory
overwrite – whether to overwrite an existing directory
- joeynmt.helpers.read_list_from_file(input_path: Path) List[str] [source]#
Read list of str from file in input_path.
- Parameters:
input_path – input file path
- Returns:
list of strings
- joeynmt.helpers.remove_extra_spaces(s: str) str [source]#
Remove extra spaces - used in pre_process() / post_process() in tokenizer.py
- Parameters:
s – input string
- Returns:
string w/o extra white spaces
- joeynmt.helpers.resolve_ckpt_path(load_model: Path, model_dir: Path) Path [source]#
Get checkpoint path. if load_model is not specified, take the best or latest checkpoint from model dir.
- Parameters:
load_model – Path(cfg[‘training’][‘load_model’]) or Path(cfg[‘testing’][‘load_model’])
model_dir – Path(cfg[‘model_dir’])
- Returns:
resolved checkpoint path
- joeynmt.helpers.save_hypothese(output_path: Path, hypotheses: List[str], n_best: str = 1) None [source]#
Save list hypothese to file.
- Parameters:
output_path – output file path
hypotheses – hypothese to write
n_best – n_best size
- joeynmt.helpers.set_seed(seed: int) None [source]#
Set the random seed for modules torch, numpy and random.
- Parameters:
seed – random seed
- joeynmt.helpers.store_attention_plots(attentions: ndarray, targets: List[List[str]], sources: List[List[str]], output_prefix: str, indices: List[int], tb_writer: SummaryWriter | None = None, steps: int = 0) None [source]#
Saves attention plots.
- Parameters:
attentions – attention scores
targets – list of tokenized targets
sources – list of tokenized sources
output_prefix – prefix for attention plots
indices – indices selected for plotting
tb_writer – Tensorboard summary writer (optional)
steps – current training steps, needed for tb_writer
dpi – resolution for images
- joeynmt.helpers.subsequent_mask(size: int) Tensor [source]#
Mask out subsequent positions (to prevent attending to future positions) Transformer helper function.
- Parameters:
size – size of mask (2nd and 3rd dim)
- Returns:
Tensor with 0s and 1s of shape (1, size, size)
- joeynmt.helpers.symlink_update(target: Path, link_name: Path) Path | None [source]#
This function finds the file that the symlink currently points to, sets it to the new target, and returns the previous target if it exists.
- Parameters:
target – A path to a file that we want the symlink to point to. no parent dir, filename only, i.e. “10000.ckpt”
link_name – This is the name of the symlink that we want to update. link name with parent dir, i.e. “models/my_model/best.ckpt”
- Returns:
- current_last: This is the previous target of the symlink, before it is
updated in this function. If the symlink did not exist before or did not have a target, None is returned instead.
- joeynmt.helpers.tile(x: Tensor, count: int, dim=0) Tensor [source]#
Tiles x on dimension dim count times. From OpenNMT. Used for beam search.
- Parameters:
x – tensor to tile
count – number of tiles
dim – dimension along which the tensor is tiled
- Returns:
tiled tensor
joeynmt.initialization module#
Implements custom initialization
- joeynmt.initialization.compute_alpha_beta(num_enc_layers: int, num_dec_layers: int) Dict[str, Dict] [source]#
DeepNet: compute alpha/beta value suggested in https://arxiv.org/abs/2203.00555
- joeynmt.initialization.initialize_model(model: Module, cfg: dict, src_padding_idx: int, trg_padding_idx: int) None [source]#
This initializes a model based on the provided config.
All initializer configuration is part of the model section of the configuration file. For an example, see e.g. https://github.com/joeynmt/joeynmt/blob/main/ configs/iwslt14_ende_spm.yaml.
The main initializer is set using the initializer key. Possible values are xavier, uniform, normal or zeros. (xavier is the default).
When an initializer is set to uniform, then init_weight sets the range for the values (-init_weight, init_weight).
When an initializer is set to normal, then init_weight sets the standard deviation for the weights (with mean 0).
The word embedding initializer is set using embed_initializer and takes the same values. The default is normal with embed_init_weight = 0.01.
Biases are initialized separately using bias_initializer. The default is zeros, but you can use the same initializers as the main initializer.
Set init_rnn_orthogonal to True if you want RNN orthogonal initialization (for recurrent matrices). Default is False.
lstm_forget_gate controls how the LSTM forget gate is initialized. Default is 1.
- Parameters:
model – model to initialize
cfg – the model configuration
src_padding_idx – index of source padding token
trg_padding_idx – index of target padding token
- joeynmt.initialization.lstm_forget_gate_init_(cell: RNNBase, value: float = 1.0) None [source]#
Initialize LSTM forget gates with value.
- Parameters:
cell – LSTM cell
value – initial value, default: 1
- joeynmt.initialization.orthogonal_rnn_init_(cell: RNNBase, gain: float = 1.0) None [source]#
Orthogonal initialization of recurrent weights RNN parameters contain 3 or 4 matrices in one parameter, so we slice it.
- joeynmt.initialization.xavier_uniform_n_(w: Tensor, gain: float = 1.0, n: int = 4) None [source]#
Xavier initializer for parameters that combine multiple matrices in one parameter for efficiency. This is e.g. used for GRU and LSTM parameters, where e.g. all gates are computed at the same time by 1 big matrix.
- Parameters:
w – parameter
gain – default 1
n – default 4
joeynmt.metrics module#
Evaluation metrics
- joeynmt.metrics.bleu(hypotheses: List[str], references: List[str], **sacrebleu_cfg) float [source]#
Raw corpus BLEU from sacrebleu (without tokenization) cf. https://github.com/mjpost/sacrebleu/blob/master/sacrebleu/metrics/bleu.py
- Parameters:
hypotheses – list of hypotheses (strings)
references – list of references (strings)
- Returns:
bleu score
- joeynmt.metrics.chrf(hypotheses: List[str], references: List[str], **sacrebleu_cfg) float [source]#
Character F-score from sacrebleu cf. https://github.com/mjpost/sacrebleu/blob/master/sacrebleu/metrics/chrf.py
- Parameters:
hypotheses – list of hypotheses (strings)
references – list of references (strings)
- Returns:
character f-score (0 <= chf <= 1) see Breaking Change in sacrebleu v2.0
- joeynmt.metrics.sequence_accuracy(hypotheses: List[str], references: List[str]) float [source]#
Compute the accuracy of hypothesis tokens: correct tokens / all tokens Tokens are correct if they appear in the same position in the reference. We lookup the references before one-hot-encoding, that is, hypotheses with UNK are always evaluated as incorrect.
- Parameters:
hypotheses – list of hypotheses (strings)
references – list of references (strings)
- Returns:
- joeynmt.metrics.token_accuracy(hypotheses: List[str], references: List[str], tokenizer: Callable) float [source]#
Compute the accuracy of hypothesis tokens: correct tokens / all tokens Tokens are correct if they appear in the same position in the reference. We lookup the references before one-hot-encoding, that is, UNK generation in hypotheses is always evaluated as incorrect.
- Parameters:
hypotheses – list of hypotheses (strings)
references – list of references (strings)
- Returns:
token accuracy (float)
joeynmt.model module#
Module to represents whole models
- class joeynmt.model.DataParallelWrapper(module: Module)[source]#
Bases:
Module
DataParallel wrapper to pass through the model attributes
- ex. 1) for DataParallel
>>> from torch.nn import DataParallel as DP >>> model = DataParallelWrapper(DP(model))
- ex. 2) for DistributedDataParallel
>>> from torch.nn.parallel import DistributedDataParallel as DDP >>> model = DataParallelWrapper(DDP(model))
- forward(*args, **kwargs)[source]#
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class joeynmt.model.Model(encoder: Encoder, decoder: Decoder, src_embed: Embeddings, trg_embed: Embeddings, src_vocab: Vocabulary, trg_vocab: Vocabulary)[source]#
Bases:
Module
Base Model class
- forward(return_type: str = None, **kwargs) Tuple[Tensor, Tensor, Tensor, Tensor] [source]#
Interface for multi-gpu
For DataParallel, We need to encapsulate all model call: model.encode(), model.decode(), and model.encode_decode() by model.__call__(). model.__call__() triggers model.forward() together with pre hooks and post hooks, which takes care of multi-gpu distribution.
- Parameters:
return_type – one of {“loss”, “encode”, “decode”}
- property loss_function#
- joeynmt.model.build_model(cfg: Dict = None, src_vocab: Vocabulary = None, trg_vocab: Vocabulary = None) Model [source]#
Build and initialize the model according to the configuration.
- Parameters:
cfg – dictionary configuration containing model specifications
src_vocab – source vocabulary
trg_vocab – target vocabulary
- Returns:
built and initialized model
joeynmt.plotting module#
Plot attentions
- joeynmt.plotting.plot_heatmap(scores: ndarray, column_labels: List[str], row_labels: List[str], output_path: str | None = None, dpi: int = 300) Figure [source]#
Plotting function that can be used to visualize (self-)attention. Plots are saved if output_path is specified, in format that this file ends with (‘pdf’ or ‘png’).
- Parameters:
scores – attention scores
column_labels – labels for columns (e.g. target tokens)
row_labels – labels for rows (e.g. source tokens)
output_path – path to save to
dpi – set resolution for matplotlib
- Returns:
pyplot figure
joeynmt.prediction module#
This module holds methods for generating predictions from a model.
- joeynmt.prediction.evaluate(valid_scores: Dict, valid_hyp: List, data: Dataset, args: TestConfig) Tuple[Dict[str, float], List[str]] [source]#
Compute evaluateion metrics
- Parameters:
valid_scores – scores dict
valid_hyp – decoded hypotheses
data – eval Dataset
args – configuration args
- Returns:
valid_scores: evaluation scores
valid_ref: postprocessed references
- joeynmt.prediction.predict(model: Model, data: Dataset, device: device, n_gpu: int, rank: int = 0, compute_loss: bool = False, normalization: str = 'batch', num_workers: int = 0, args: TestConfig = None, autocast: Dict = None) Tuple[Dict[str, float], List[str] | None, List[str] | None, List[List[str]], List[ndarray], List[ndarray]] [source]#
Generates translations for the given data. If compute_loss is True and references are given, also computes the loss.
- Parameters:
model – model module
data – dataset for validation
device – torch device
n_gpu – number of GPUs
rank – ddp rank
compute_loss – whether to compute a scalar loss for given inputs and targets
normalization – one of {batch, tokens, none}
num_workers – number of workers for collate_fn() in data iterator
args – configuration args
autocast – autocast context
- Returns:
valid_scores: (dict) current validation scores,
valid_ref: (list of str) post-processed validation references,
valid_hyp: (list of str) post-processed validation hypotheses,
decoded_valid: (list of list of str) token-level validation hypotheses,
valid_seq_scores: (list of np.array) log probabilities (hyp or ref)
valid_attn_scores: (list of np.array) attention scores (hyp or ref)
- joeynmt.prediction.prepare(args: BaseConfig, rank: int, mode: str) Tuple[Model, Dataset, Dataset, Dataset] [source]#
Helper function for model and data loading.
- Parameters:
args – config args
rank – ddp rank
mode – execution mode
- joeynmt.prediction.test(cfg: Dict, output_path: str = None, prepared: Dict = None, save_attention: bool = False, save_scores: bool = False) None [source]#
Main test function. Handles loading a model from checkpoint, generating translations, storing them, and plotting attention.
- Parameters:
cfg – configuration dict
output_path – path to output
prepared – model and datasets passed from training
save_attention – whether to save attention visualizations
save_scores – whether to save scores
- joeynmt.prediction.translate(cfg: Dict, output_path: str = None) None [source]#
Interactive translation function. Loads model from checkpoint and translates either the stdin input or asks for input to translate interactively. Translations and scores are printed to stdout. Note: The input sentences don’t have to be pre-tokenized.
- Parameters:
cfg – configuration dict
output_path – path to output file
joeynmt.search module#
Search module
- joeynmt.search.beam_search(model: Model, beam_size: int, encoder_output: Tensor, encoder_hidden: Tensor, src_mask: Tensor, max_output_length: int, alpha: float, n_best: int = 1, **kwargs) Tuple[Tensor, Tensor, Tensor] [source]#
Beam search with size k. In each decoding step, find the k most likely partial hypotheses. Inspired by OpenNMT-py, adapted for Transformer.
- Parameters:
model –
beam_size – size of the beam
encoder_output –
encoder_hidden –
src_mask –
max_output_length –
alpha – alpha factor for length penalty
n_best – return this many hypotheses, <= beam (currently only 1)
- Returns:
stacked_output: output hypotheses (2d array of indices),
stacked_scores: scores (2d array of sequence-wise log probabilities),
stacked_attention_scores: attention scores (3d array)
- joeynmt.search.greedy(src_mask: Tensor, max_output_length: int, model: Model, encoder_output: Tensor, encoder_hidden: Tensor, **kwargs) Tuple[Tensor, Tensor, Tensor] [source]#
Greedy decoding. Select the token word highest probability at each time step. This function is a wrapper that calls recurrent_greedy for recurrent decoders and transformer_greedy for transformer decoders.
- Parameters:
src_mask – mask for source inputs, 0 for positions after </s>
max_output_length – maximum length for the hypotheses
model – model to use for greedy decoding
encoder_output – encoder hidden states for attention
encoder_hidden – encoder last state for decoder initialization
- Returns:
stacked_output: output hypotheses (2d array of indices),
stacked_scores: scores (2d array of token-wise log probabilities),
stacked_attention_scores: attention scores (3d array)
- joeynmt.search.search(model: Model, batch: Batch, max_output_length: int, beam_size: int, beam_alpha: float, n_best: int = 1, **kwargs) Tuple[ndarray, ndarray, ndarray] [source]#
Get outputs and attentions scores for a given batch.
- Parameters:
model – Model class
batch – batch to generate hypotheses for
max_output_length – maximum length of hypotheses
beam_size – size of the beam for beam search, if 0 use greedy
beam_alpha – alpha value for beam search
n_best – candidates to return
- Returns:
stacked_output: hypotheses for batch,
stacked_scores: log probabilities for batch,
stacked_attention_scores: attention scores for batch
joeynmt.tokenizers module#
Tokenizer module
- class joeynmt.tokenizers.BasicTokenizer(level: str = 'word', lowercase: bool = False, normalize: bool = False, max_length: int = -1, min_length: int = -1, **kwargs)[source]#
Bases:
object
- SPACE = ' '#
- SPACE_ESCAPE = '▁'#
- post_process(sequence: List[str] | str, generate_unk: bool = True, cut_at_sep: bool = True) str [source]#
Detokenize
- pre_process(raw_input: str, allow_empty: bool = False) str [source]#
- Pre-process text
- ex.) Lowercase, Normalize, Remove emojis,
Pre-tokenize(add extra white space before punc) etc.
applied for all inputs both in training and inference.
- Parameters:
raw_input – raw input string
allow_empty – whether to allow empty string
- Returns:
preprocessed input string
- class joeynmt.tokenizers.SentencePieceTokenizer(level: str = 'bpe', lowercase: bool = False, normalize: bool = False, max_length: int = -1, min_length: int = -1, **kwargs)[source]#
Bases:
BasicTokenizer
- class joeynmt.tokenizers.SubwordNMTTokenizer(level: str = 'bpe', lowercase: bool = False, normalize: bool = False, max_length: int = -1, min_length: int = -1, **kwargs)[source]#
Bases:
BasicTokenizer
- joeynmt.tokenizers.build_tokenizer(cfg: Dict) Dict[str, BasicTokenizer] [source]#
joeynmt.training module#
Training module
- class joeynmt.training.TrainManager(rank: int, model: Model, model_dir: Path, device: device, n_gpu: int = 0, num_workers: int = 0, autocast: Dict = None, seed: int = 42, train_args: TrainConfig = None, dev_args: TestConfig = None)[source]#
Bases:
object
Manages training loop, validations, learning rate scheduling and early stopping.
- class TrainStatistics(minimize_metric: bool = True)[source]#
Bases:
object
Train Statistics
- Parameters:
epochs – epoch counter
steps – global update step counter
is_min_lr – stop by reaching learning rate minimum
is_max_update – stop by reaching max num of updates
total_tokens – number of total tokens seen so far
best_ckpt_iter – store iteration point of best ckpt
minimize_metric – minimize or maximize score
total_correct – number of correct tokens seen so far
- init_from_checkpoint(path: Path, reset_best_ckpt: bool = False, reset_scheduler: bool = False, reset_optimizer: bool = False, reset_iter_state: bool = False) None [source]#
Initialize the trainer from a given checkpoint file.
This checkpoint file contains not only model parameters, but also scheduler and optimizer states, see self._save_checkpoint.
- Parameters:
path – path to checkpoint
reset_best_ckpt – reset tracking of the best checkpoint, use for domain adaptation with a new dev set.
reset_scheduler – reset the learning rate scheduler, and do not use the one stored in the checkpoint.
reset_optimizer – reset the optimizer, and do not use the one stored in the checkpoint.
reset_iter_state – reset the sampler’s internal state and do not use the one stored in the checkpoint.
- joeynmt.training.train(rank: int, world_size: int, cfg: Dict, skip_test: bool = False) None [source]#
Main training function. After training, also test on test data if given.
- Parameters:
rank – ddp local rank
world_size – ddp world size
cfg – configuration dict
skip_test – whether a test should be run or not after training
joeynmt.vocabulary module#
Vocabulary module
- class joeynmt.vocabulary.Vocabulary(tokens: List[str], cfg: SimpleNamespace)[source]#
Bases:
object
Vocabulary represents mapping between tokens and indices.
- add_tokens(tokens: List[str]) None [source]#
Add list of tokens to vocabulary
- Parameters:
tokens – list of tokens to add to the vocabulary
- arrays_to_sentences(arrays: ndarray, cut_at_eos: bool = True, skip_pad: bool = True) List[List[str]] [source]#
Convert multiple arrays containing sequences of token IDs to their sentences, optionally cutting them off at the end-of-sequence token.
- Parameters:
arrays – 2D array containing indices
cut_at_eos – cut the decoded sentences at the first <eos>
skip_pad – skip generated <pad> tokens
- Returns:
list of list of strings (tokens)
- is_unk(token: str) bool [source]#
Check whether a token is covered by the vocabulary
- Parameters:
token –
- Returns:
True if covered, False otherwise
- lookup(token: str) int [source]#
look up the encoding dictionary. (needed for multiprocessing)
- Parameters:
token – surface str
- Returns:
token id
- sentences_to_ids(sentences: List[List[str]], bos: bool = True, eos: bool = True) Tuple[List[List[int]], List[int], List[int]] [source]#
Encode sentences to indices and pad sequences to the maximum length of the sentences given
- Parameters:
sentences – list of tokenized sentences
bos – whether to add <bos>
eos – whether to add <eos>
- Returns:
padded ids
original lengths before padding
prompt_mask
- joeynmt.vocabulary.build_vocab(cfg: Dict, dataset: BaseDataset = None, model_dir: Path = None) Tuple[Vocabulary, Vocabulary] [source]#
- joeynmt.vocabulary.sort_and_cut(counter: Counter, max_size: int = 9223372036854775807, min_freq: int = -1) List[str] [source]#
Cut counter to most frequent, sorted numerically and alphabetically :param counter: flattened token list in Counter object :param max_size: maximum size of vocabulary :param min_freq: minimum frequency for an item to be included :return: list of valid tokens
joeynmt.loss module#
Module to implement training loss
- class joeynmt.loss.XentLoss(pad_index: int, smoothing: float = 0.0)[source]#
Bases:
Module
Cross-Entropy Loss with optional label smoothing
- forward(log_probs: Tensor, **kwargs) Tensor [source]#
Compute the cross-entropy between logits and targets.
If label smoothing is used, target distributions are not one-hot, but “1-smoothing” for the correct target token and the rest of the probability mass is uniformly spread across the other tokens.
- Parameters:
log_probs – log probabilities as predicted by model
- Returns:
logits
joeynmt.transformer_layers module#
Transformer layers
- class joeynmt.transformer_layers.MultiHeadedAttention(num_heads: int, size: int, dropout: float = 0.1)[source]#
Bases:
Module
Multi-Head Attention module from “Attention is All You Need”
Implementation modified from OpenNMT-py. https://github.com/OpenNMT/OpenNMT-py
- forward(k: Tensor, v: Tensor, q: Tensor, mask: Tensor | None = None, return_weights: bool | None = None)[source]#
Computes multi-headed attention.
- Parameters:
k – keys [batch_size, seq_len, hidden_size]
v – values [batch_size, seq_len, hidden_size]
q – query [batch_size, seq_len, hidden_size]
mask – optional mask [batch_size, 1, seq_len]
return_weights – whether to return the attention weights, averaged over heads.
- Returns:
output [batch_size, query_len, hidden_size]
attention_weights [batch_size, query_len, key_len]
- class joeynmt.transformer_layers.PositionalEncoding(size: int = 0, max_len: int = 5000)[source]#
Bases:
Module
Pre-compute position encodings (PE). In forward pass, this adds the position-encodings to the input for as many time steps as necessary.
Implementation based on OpenNMT-py. https://github.com/OpenNMT/OpenNMT-py
- class joeynmt.transformer_layers.PositionwiseFeedForward(input_size: int, ff_size: int, dropout: float = 0.1, alpha: float = 1.0, layer_norm: str = 'post', activation: str = 'relu')[source]#
Bases:
Module
Position-wise Feed-forward layer Projects to ff_size and then back down to input_size.
- forward(x: Tensor) Tensor [source]#
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class joeynmt.transformer_layers.TransformerDecoderLayer(size: int = 0, ff_size: int = 0, num_heads: int = 0, dropout: float = 0.1, alpha: float = 1.0, layer_norm: str = 'post', activation: str = 'relu')[source]#
Bases:
Module
Transformer decoder layer.
Consists of self-attention, source-attention, and feed-forward.
- forward(x: Tensor, memory: Tensor, src_mask: Tensor, trg_mask: Tensor, return_attention: bool = False, **kwargs) Tensor [source]#
Forward pass of a single Transformer decoder layer.
First applies target-target self-attention, dropout with residual connection (adding the input to the result), and layer norm.
Second computes source-target cross-attention, dropout with residual connection (adding the self-attention to the result), and layer norm.
Finally goes through a position-wise feed-forward layer.
- Parameters:
x – inputs
memory – source representations
src_mask – source mask
trg_mask – target mask (so as not to condition on future steps)
return_attention – whether to return the attention weights
- Returns:
output tensor
attention weights
- class joeynmt.transformer_layers.TransformerEncoderLayer(size: int = 0, ff_size: int = 0, num_heads: int = 0, dropout: float = 0.1, alpha: float = 1.0, layer_norm: str = 'post', activation: str = 'relu')[source]#
Bases:
Module
One Transformer encoder layer has a Multi-head attention layer plus a position-wise feed-forward layer.
- forward(x: Tensor, mask: Tensor) Tensor [source]#
Forward pass for a single transformer encoder layer. First applies self attention, then dropout with residual connection (adding the input to the result), then layer norm, and then a position-wise feed-forward layer.
- Parameters:
x – layer input
mask – input mask
- Returns:
output tensor