Make preprocessor


There are 6 kinds of preprocessor in chariot.

Each preprocessor is defined as scikit-learn Transformer. Because of this, these preprocessors locate at chariot.transformer.

You can initialize parameters of preprocessor by fit, and apply preprocess by transform.

Text preprocessor

The role of the Text preprocessor is arranging text before tokenization.

import chariot.transformer as ct


text = "Hey! you preprocess text now :)"
preprocessed = ct.text.SymbolFilter().transform(text)
> Hey you preprocess text now

Tokenizer

The role of the Tokenizer is tokenizing text.

import chariot.transformer as ct


text = "Hey you preprocess text now"
tokens = ct.Tokenizer(lang="en").transform(text)
> [<Hey:INTJ>, <you:PRON>, <preprocess:ADJ>, <text:NOUN>, <now:ADV>]

When tokenize a text, chariot use spacy mainly. You can specify language by lang parameter. But if you want to tokenize Japanese text, you have to prepare the Janome or MeCab since spacy does not support Japanese well.

Token preprocessor

The role of the Token preprocessor is filter/normalize tokens before building vocabulary.

import chariot.transformer as ct


text = "Hey you preprocess text now"
tokens = ct.Tokenizer(lang="en").transform(text)
filtered = ct.token.StopwordFilter(lang="en").transform(tokens)
> [<Hey:INTJ>, <preprocess:ADJ>, <text:NOUN>]

Vocabulary

The role of the Vocabulary is convert word to vocabulary index.

import chariot.transformer as ct


vocab = Vocabulary()

doc = [
    ["you", "are", "reading", "the", "book"],
    ["I", "don't", "know", "its", "title"],
]

vocab.fit(doc)
text = ["you", "know", "book", "title"]
indexed = vocab.transform(text)
inversed = vocab.inverse_transform(indexed)
> [4, 11, 8, 13]
> ['you', 'know', 'book', 'title']

You can specify the reserved word for unknown word etc and set parameters to limit vocabulary size. Example like following.

vocab = Vocabulary(padding="_pad_", unknown="_unk_", min_df=1)

Formatter

The role of the Formatter is adjust data for your model.

import chariot.transformer as ct


formatter = ct.formatter.Padding(padding=0, length=5)

data = [
    [1, 2],
    [3, 4, 5],
    [1, 2, 3, 4, 5]
]

padded = formatter.transform(data)
> [[1 2 0 0 0]
   [3 4 5 0 0]
   [1 2 3 4 5]]

Generator

The role of the Generator is generating the target / source data for your model.
For example, when you train the language model, your target data is shifted source data.

import chariot.transformer as ct


generator = ct.generator.ShiftedTarget(shift=1)
source, target = generator.generate([1, 2, 3, 4, 5], index=0, length=3)
> source
[1, 2, 3]
> target
[2, 3, 4]

Now you learned the role of each preprocessor. Then let's make preprocessor pipeline by composing these.