Skip to content

Tokenizers

  • import from paradedb.tokenizers module

Available Tokenizers

Below are the available tokenizer classes you can use

Available Stemmer Languages

Arabic, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Turkish, Polish

Available Stopword Languages

Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish, Polish


WhitespaceTokenizer

WhitespaceTokenizer(
    stemmer: typing.Optional[str] = None,
    remove_long: typing.Optional[int] = None,
    lowercase: typing.Optional[bool] = None,
    stopwords_language: typing.Optional[str] = None,
    stopwords: typing.Optional[typing.List[str]] = None,
    ascii_folding: typing.Optional[bool] = None,
)

see ParadeDB whitespace tokenizer


RawTokenizer

RawTokenizer(
    stemmer: typing.Optional[str] = None,
    remove_long: typing.Optional[int] = None,
    lowercase: typing.Optional[bool] = None,
    stopwords_language: typing.Optional[str] = None,
    stopwords: typing.Optional[typing.List[str]] = None,
    ascii_folding: typing.Optional[bool] = None,
)

see ParadeDB raw tokenizer


KeyWordTokenizer

KeyWordTokenizer(
    stemmer: typing.Optional[str] = None,
    remove_long: typing.Optional[int] = None,
    lowercase: typing.Optional[bool] = None,
    stopwords_language: typing.Optional[str] = None,
    stopwords: typing.Optional[typing.List[str]] = None,
    ascii_folding: typing.Optional[bool] = None,
)

see ParadeDB keyword tokenizer


SourceCodeTokenizer

SourceCodeTokenizer(
    stemmer: typing.Optional[str] = None,
    remove_long: typing.Optional[int] = None,
    lowercase: typing.Optional[bool] = None,
    stopwords_language: typing.Optional[str] = None,
    stopwords: typing.Optional[typing.List[str]] = None,
    ascii_folding: typing.Optional[bool] = None,
)

see ParadeDB sourcecode tokenizer


ChineseCompatibleTokenizer

ChineseCompatibleTokenizer(
    stemmer: typing.Optional[str] = None,
    remove_long: typing.Optional[int] = None,
    lowercase: typing.Optional[bool] = None,
    stopwords_language: typing.Optional[str] = None,
    stopwords: typing.Optional[typing.List[str]] = None,
    ascii_folding: typing.Optional[bool] = None,
)

see ParadeDB chinese compatible tokenizer


LinderaTokenizer

LinderaTokenizer(
    stemmer: typing.Optional[str] = None,
    remove_long: typing.Optional[int] = None,
    lowercase: typing.Optional[bool] = None,
    stopwords_language: typing.Optional[str] = None,
    stopwords: typing.Optional[typing.List[str]] = None,
    ascii_folding: typing.Optional[bool] = None,
)
see ParadeDB lindera tokenizer


JiebaTokenizer

JiebaTokenizer(
    stemmer: typing.Optional[str] = None,
    remove_long: typing.Optional[int] = None,
    lowercase: typing.Optional[bool] = None,
    stopwords_language: typing.Optional[str] = None,
    stopwords: typing.Optional[typing.List[str]] = None,
    ascii_folding: typing.Optional[bool] = None,
)
see ParadeDB jieba okenizer


ICUTokenizer

ICUTokenizer(
    stemmer: typing.Optional[str] = None,
    remove_long: typing.Optional[int] = None,
    lowercase: typing.Optional[bool] = None,
    stopwords_language: typing.Optional[str] = None,
    stopwords: typing.Optional[typing.List[str]] = None,
    ascii_folding: typing.Optional[bool] = None,
)
see ParadeDB icu tokenizer


RegexTokenizer

RegexTokenizer(
    pattern: str,
    stemmer: typing.Optional[str] = None,
    remove_long: typing.Optional[int] = None,
    lowercase: typing.Optional[bool] = None,
    stopwords_language: typing.Optional[str] = None,
    stopwords: typing.Optional[typing.List[str]] = None,
    ascii_folding: typing.Optional[bool] = None,
)

see ParadeDB regex tokenizer


NGramTokenizer

NGramTokenizer(
    min_gram: int,
    max_gram: int,
    prefix_only: bool = False,
    stemmer: typing.Optional[str] = None,
    remove_long: typing.Optional[int] = None,
    lowercase: typing.Optional[bool] = None,
    ascii_folding: typing.Optional[bool] = None,
)

see ParadeDB Ngram tokenizer


Example Using Tokenizers with BM25 Index

from paradedb.indexes import Bm25Index, IndexFieldConfig, TextFieldIndexConfig, JSONFieldIndexConfig
from paradedb.tokenizers import WhitespaceTokenizer, NGramTokenizer

Bm25Index(
    fields=["id", "title", "description", "metadata"],
    name="bm25_idx",
    fields_config=IndexFieldConfig(
        text_fields=[
            TextFieldIndexConfig(
                field="title",
                fast=True,
                tokenizer=WhitespaceTokenizer(),
                normalizer="lowercase",
                record="position"
            ),
            TextFieldIndexConfig(
                field="description",
                fast=True,
                tokenizer=NGramTokenizer(min_gram=2, max_gram=3),
                normalizer="lowercase",
                record="position"
            )
        ],
        json_fields=[
            JSONFieldIndexConfig(
                field="metadata",
                tokenizer=WhitespaceTokenizer(),
                expand_dots=True
            )
        ]
    )
)