Make preprocessor pipeline


You can combine each preprocessor to make pipeline process. As the name pipeline indicates, it just same as the scikit-learn Pipeline.

Define a Pipeline

You can use Preprocessor to combine each preprocessors.

import chariot.transformer as ct
from chariot.preprocessor import Preprocessor


preprocessor = Preprocessor()
preprocessor\
    .stack(ct.text.UnicodeNormalizer())\
    .stack(ct.Tokenizer("en"))\
    .stack(ct.token.StopwordFilter("en"))\
    .stack(ct.Vocabulary(min_df=5, max_df=0.5))\
    .fit(train_data)

preprocessed = preprocessor.transform(data)

You can save & load the Preprocessor.

preprocessor.save("my_preprocessor.pkl")

loaded = Preprocessor.load("my_preprocessor.pkl")

It means you can pack & carry the preprocess by .pkl file.

Make pipeline for dataset

When you want to apply distinctive preprocess for each column of a dataset, you can use DatasetPreprocessor.

from chariot.dataset_preprocessor import DatasetPreprocessor
from chariot.transformer.formatter import Padding


dp = DatasetPreprocessor()
dp.process("review")\
    .by(ct.text.UnicodeNormalizer())\
    .by(ct.Tokenizer("en"))\
    .by(ct.token.StopwordFilter("en"))\
    .by(ct.Vocabulary(min_df=5, max_df=0.5))\
    .by(Padding(length=pad_length))\
    .fit(train_data["review"])
dp.process("polarity")\
    .by(ct.formatter.CategoricalLabel(num_class=3))


preprocessed = dp.preprocess(data)

You can save & load DatasetPreprocessor as preprocessor.

dp.save("my_dataset_preprocessor.tar.gz")
loaded = DatasetPreprocessor.load("my_dataset_preprocessor.tar.gz")

Why you preprocess the data? Of course you want to train your model!
Next feed data to your model.