Tokenizers
- import from
paradedb.tokenizersmodule
Available Tokenizers¶
Below are the available tokenizer classes you can use
Available Stemmer Languages¶
Arabic, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Turkish, Polish
Available Stopword Languages¶
Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish, Polish
WhitespaceTokenizer¶
WhitespaceTokenizer(
stemmer: typing.Optional[str] = None,
remove_long: typing.Optional[int] = None,
lowercase: typing.Optional[bool] = None,
stopwords_language: typing.Optional[str] = None,
stopwords: typing.Optional[typing.List[str]] = None,
ascii_folding: typing.Optional[bool] = None,
)
see ParadeDB whitespace tokenizer
RawTokenizer¶
RawTokenizer(
stemmer: typing.Optional[str] = None,
remove_long: typing.Optional[int] = None,
lowercase: typing.Optional[bool] = None,
stopwords_language: typing.Optional[str] = None,
stopwords: typing.Optional[typing.List[str]] = None,
ascii_folding: typing.Optional[bool] = None,
)
KeyWordTokenizer¶
KeyWordTokenizer(
stemmer: typing.Optional[str] = None,
remove_long: typing.Optional[int] = None,
lowercase: typing.Optional[bool] = None,
stopwords_language: typing.Optional[str] = None,
stopwords: typing.Optional[typing.List[str]] = None,
ascii_folding: typing.Optional[bool] = None,
)
see ParadeDB keyword tokenizer
SourceCodeTokenizer¶
SourceCodeTokenizer(
stemmer: typing.Optional[str] = None,
remove_long: typing.Optional[int] = None,
lowercase: typing.Optional[bool] = None,
stopwords_language: typing.Optional[str] = None,
stopwords: typing.Optional[typing.List[str]] = None,
ascii_folding: typing.Optional[bool] = None,
)
see ParadeDB sourcecode tokenizer
ChineseCompatibleTokenizer¶
ChineseCompatibleTokenizer(
stemmer: typing.Optional[str] = None,
remove_long: typing.Optional[int] = None,
lowercase: typing.Optional[bool] = None,
stopwords_language: typing.Optional[str] = None,
stopwords: typing.Optional[typing.List[str]] = None,
ascii_folding: typing.Optional[bool] = None,
)
see ParadeDB chinese compatible tokenizer
LinderaTokenizer¶
LinderaTokenizer(
stemmer: typing.Optional[str] = None,
remove_long: typing.Optional[int] = None,
lowercase: typing.Optional[bool] = None,
stopwords_language: typing.Optional[str] = None,
stopwords: typing.Optional[typing.List[str]] = None,
ascii_folding: typing.Optional[bool] = None,
)
JiebaTokenizer¶
JiebaTokenizer(
stemmer: typing.Optional[str] = None,
remove_long: typing.Optional[int] = None,
lowercase: typing.Optional[bool] = None,
stopwords_language: typing.Optional[str] = None,
stopwords: typing.Optional[typing.List[str]] = None,
ascii_folding: typing.Optional[bool] = None,
)
ICUTokenizer¶
ICUTokenizer(
stemmer: typing.Optional[str] = None,
remove_long: typing.Optional[int] = None,
lowercase: typing.Optional[bool] = None,
stopwords_language: typing.Optional[str] = None,
stopwords: typing.Optional[typing.List[str]] = None,
ascii_folding: typing.Optional[bool] = None,
)
RegexTokenizer¶
RegexTokenizer(
pattern: str,
stemmer: typing.Optional[str] = None,
remove_long: typing.Optional[int] = None,
lowercase: typing.Optional[bool] = None,
stopwords_language: typing.Optional[str] = None,
stopwords: typing.Optional[typing.List[str]] = None,
ascii_folding: typing.Optional[bool] = None,
)
NGramTokenizer¶
NGramTokenizer(
min_gram: int,
max_gram: int,
prefix_only: bool = False,
stemmer: typing.Optional[str] = None,
remove_long: typing.Optional[int] = None,
lowercase: typing.Optional[bool] = None,
ascii_folding: typing.Optional[bool] = None,
)
Example Using Tokenizers with BM25 Index¶
from paradedb.indexes import Bm25Index, IndexFieldConfig, TextFieldIndexConfig, JSONFieldIndexConfig
from paradedb.tokenizers import WhitespaceTokenizer, NGramTokenizer
Bm25Index(
fields=["id", "title", "description", "metadata"],
name="bm25_idx",
fields_config=IndexFieldConfig(
text_fields=[
TextFieldIndexConfig(
field="title",
fast=True,
tokenizer=WhitespaceTokenizer(),
normalizer="lowercase",
record="position"
),
TextFieldIndexConfig(
field="description",
fast=True,
tokenizer=NGramTokenizer(min_gram=2, max_gram=3),
normalizer="lowercase",
record="position"
)
],
json_fields=[
JSONFieldIndexConfig(
field="metadata",
tokenizer=WhitespaceTokenizer(),
expand_dots=True
)
]
)
)