sumire.tokenizer package

Submodules

sumire.tokenizer.auto module

class sumire.tokenizer.auto.AutoJapaneseTokenizer(path: str | None = None, *args, **kwargs)

Bases: BaseTokenizer

AutoJapaneseTokenizer automatically selects the tokenizer based on the given path or uses MecabTokenizer if no path is provided.

Parameters:: path (str, optional) – The directory path to a saved tokenizer configuration. If not provided, a simple MecabTokenizer is used.

Example

>>> tokenizer = AutoJapaneseTokenizer()
>>> text = texts = ["これはテスト文です。", "別のテキストもトークン化します。"]
>>> tokens = tokenizer.tokenize(text)
>>> tokens[0]
['これ', 'は', 'テスト', '文', 'です', '。']

classmethod from_pretrained(path: str | Path) → TokenizerType

Loads a tokenizer from a saved configuration.

Parameters:: path (str or Path) – Directory path to the saved configuration.
Returns:: Tokenizer instance initialized with the saved configuration.
Return type:: TokenizerType

save_pretrained(path: str | Path)

Saves tokenizer configuration to the specified path.

Parameters:: path (str or Path) – Directory path to save the configuration.

tokenize(inputs: str | List[str]) → List[List[str]]

Tokenizes input text.

Parameters:: inputs (str or List[str]) – Text or list of texts to tokenize.
Returns:: List of tokenized texts.
Return type:: List[List[str]]

sumire.tokenizer.auto.tokenizer_dict(name: str) → Type[MecabTokenizer | JumanppTokenizer | SpacyGinzaTokenizer | SentencePieceTokenizer | SudachiTokenizer]

sumire.tokenizer.base module

class sumire.tokenizer.base.BaseTokenizer(*args, **kwargs)

Bases: ABC

fit(inputs: str | List[str]) → None: Training tokenizer if necessary.

classmethod from_pretrained(path: str | Path)

Loads a pretrained tokenizer from the specified path.

Parameters:: path (str or Path) – The directory path where the pretrained tokenizer is saved.
Returns:: An instance of the pretrained tokenizer.
Return type:: BaseTokenizer
Raises:: NotImplementedError – This method must be implemented in derived classes.

abstract save_pretrained(path: str | Path)

Saves the pretrained tokenizer to the specified path.

Parameters:: path (str or Path) – The directory path where the pretrained tokenizer will be saved.
Raises:: NotImplementedError – This method must be implemented in derived classes.

abstract tokenize(inputs: str | List[str]) → List[List[str]]

Tokenizes the input text or list of texts.

Parameters:: inputs (TokenizerInputs) – The input text or a list of texts to tokenize.
Returns:: A list of lists, where each inner list represents tokenized words for a single input text.
Return type:: TokenizerOutputs
Raises:: NotImplementedError – This method must be implemented in derived classes.

tokenizer_config_file = 'tokenizer_config.json'

sumire.tokenizer.jumanpp module

class sumire.tokenizer.jumanpp.JumanppTokenizer(*args, **kwargs)

Bases: BaseTokenizer

Tokenizer class using Juman++ for Japanese text tokenization.

Example:

>>> tokenizer = JumanppTokenizer()
>>> text = "これはテスト文です。"
>>> tokens = tokenizer.tokenize(text)
>>> tokens
[['これ', 'は', 'テスト', '文', 'です', '。']]

classmethod from_pretrained(path: str | Path)

Loads a tokenizer from a saved configuration.

Parameters:: path (str or Path) – Directory path to the saved configuration.
Returns:: Tokenizer instance initialized with the saved configuration.
Return type:: JumanppTokenizer

save_pretrained(path: str | Path)

Saves tokenizer configuration to the specified path.

Parameters:: path (str or Path) – Directory path to save the configuration.

tokenize(inputs: str | List[str]) → List[List[str]]

Tokenizes input text.

Parameters:: inputs (str or List[str]) – Text or list of texts to tokenize.
Returns:: List of tokenized texts.
Return type:: List[List[str]]

Example

>>> tokenizer = JumanppTokenizer()
>>> text = "これはテスト文です。"
>>> tokens = tokenizer.tokenize(text)
>>> tokens
[['これ', 'は', 'テスト', '文', 'です', '。']]

sumire.tokenizer.mecab module

class sumire.tokenizer.mecab.MecabTokenizer(dictionary: Literal['unidic', 'unidic-lite', 'ipadic', 'mecab-ipadic-neologd', 'mecab-unidic-neologd'] = 'unidic', *args, **kwargs)

Bases: BaseTokenizer

Tokenizer class using MeCab for Japanese text tokenization.

Parameters:: dictionary (AvailableDictionaries, optional) – Dictionary to use for tokenization. Default is “unidic”.

Example

>>> tokenizer = MecabTokenizer()
>>> text = texts = ["これはテスト文です。", "別のテキストもトークン化します。"]
>>> tokens = tokenizer.tokenize(text)
>>> tokens
[['これ', 'は', 'テスト', '文', 'です', '。'], ['別', 'の', 'テキスト', 'も', 'トークン', '化', 'し', 'ます', '。']]

classmethod from_pretrained(path: str | Path)

Loads a tokenizer from a saved configuration.

Parameters:: path (str or Path) – Directory path to the saved configuration.
Returns:: A tokenizer loaded from the specified configuration.
Return type:: MecabTokenizer

save_pretrained(path: str | Path)

Saves tokenizer configuration to the specified path.

Parameters:: path (str or Path) – Directory path to save the configuration.

setup_tagger(dictionary: Literal['unidic', 'unidic-lite', 'ipadic', 'mecab-ipadic-neologd', 'mecab-unidic-neologd']) → Tagger

Sets up the MeCab tagger based on the selected dictionary.

Parameters:: dictionary (AvailableDictionaries) – Dictionary to use for tokenization.
Returns:: MeCab tagger configured with the selected dictionary.
Return type:: Tagger

tokenize(inputs: str | List[str]) → List[List[str]]

Tokenizes input text using MeCab.

Parameters:: inputs (str or List[str]) – Text or list of texts to tokenize.
Returns:: Tokenized texts as lists of strings.
Return type:: TokenizerOutputs

Example

>>> tokenizer = MecabTokenizer()
>>> texts = ["これはテスト文です。", "別のテキストもトークン化します。"]
>>> tokens = tokenizer.tokenize(texts)
>>> tokens
[['これ', 'は', 'テスト', '文', 'です', '。'], ['別', 'の', 'テキスト', 'も', 'トークン', '化', 'し', 'ます', '。']]

sumire.tokenizer.spacy_ginza module

class sumire.tokenizer.spacy_ginza.SpacyGinzaTokenizer(*args, **kwargs)

Bases: BaseTokenizer

Tokenizer class using SpaCy with the Ginza model for Japanese text tokenization.

Example

>>> tokenizer = SpacyGinzaTokenizer()
>>> text = "これはテスト文です。"
>>> tokens = tokenizer.tokenize(text)
>>> tokens
[['これ', 'は', 'テスト', '文', 'です', '。']]

classmethod from_pretrained(path: str | Path)

Loads a tokenizer from a saved configuration.

Args:
path (str or Path): Directory path to the saved configuration.

Returns:: Tokenizer instance initialized with the saved configuration.
Return type:: SpacyGinzaTokenizer

save_pretrained(path: str | Path)

Saves tokenizer configuration to the specified path.

Parameters:: path (str or Path) – Directory path to save the configuration.

tokenize(inputs: str | List[str]) → List[List[str]]

Tokenizes input text using SpaCy with the Ginza model.

Parameters:: inputs (str or List[str]) – Text or list of texts to tokenize.
Returns:: List of tokenized texts.
Return type:: List[List[str]]

Example

>>> tokenizer = SpacyGinzaTokenizer()
>>> text = "これはテスト文です。"
>>> tokens = tokenizer.tokenize(text)
>>> tokens
[['これ', 'は', 'テスト', '文', 'です', '。']]

sumire.tokenizer.spm module

class sumire.tokenizer.spm.SentencePieceTokenizer(vocab_size: int = 32000, model_type: Literal['unigram', 'bpe', 'char', 'word'] = 'unigram', character_coverage: float = 0.995, spm_model: SentencePieceProcessor | None = None, *args, **kwargs)

Bases: BaseTokenizer

Tokenizer class using SentencePiece for text tokenization.

Parameters:

vocab_size (int, optional) – Vocabulary size. Default is 32000.
model_type (ModelType, optional) – Type of SentencePiece model. Default is “unigram”.
character_coverage (float, optional) – Character coverage for SentencePiece model. Default is 0.995.
spm_model (sentencepiece.SentencePieceProcessor, optional) – Pre-trained SentencePiece model. Default is None.

decode(inputs: List[int] | List[List[int]]) → List[str]

Decodes input tokens using SentencePiece.

Parameters:: inputs (List[int] or List[List[int]]) – Encoded tokens to decode.
Returns:: Decoded texts as strings.
Return type:: List[str]

encode(inputs: str | List[str]) → List[List[int]]

Encodes input text using SentencePiece.

Parameters:: inputs (str or List[str]) – Text or list of texts to encode.
Returns:: Encoded tokens as lists of integers.
Return type:: List[List[int]]

fit(inputs: List[str]) → None

Fits the tokenizer on a list of texts.

Parameters:: inputs (List[str]) – List of texts to fit the tokenizer on.

classmethod from_pretrained(path: str | Path)

Loads tokenizer from a saved configuration.

Parameters:: path (str or Path) – Directory path to the saved configuration.
Returns:: Tokenizer instance initialized with the saved configuration.
Return type:: SentencePieceTokenizer

model_cache_dir = PosixPath('/home/runner/.cache/sumire/sentencepiece/1f520c58-5577-4937-a1f3-87a1288f9a92')

save_pretrained(path: str | Path)

Saves tokenizer configuration to the specified path.

Parameters:: path (str or Path) – Directory path to save the configuration.

tokenize(inputs: str | List[str]) → List[List[str]]

Tokenizes input text using SentencePiece.

Parameters:: inputs (str or List[str]) – Text or list of texts to tokenize.
Returns:: Tokenized texts as strings.
Return type:: TokenizerOutputs

tokenizer_file_prefix = 'sentencepiece'

sumire.tokenizer.sudachi module

class sumire.tokenizer.sudachi.SudachiTokenizer(dict_type: str = 'full', normalize: bool = False, *args, **kwargs)

Bases: BaseTokenizer

Class for a custom tokenizer using SudachiPy.

Parameters:

dict_type (str, optional) – Type of SudachiPy dictionary. Default is “full”.
normalize (bool, optional) – Flag to normalize tokens. Default is False.

dict_type

Type of SudachiPy dictionary.

Type:: str

normalize

Flag to normalize tokens.

Type:: bool

tokenizer

SudachiPy tokenizer object.

Type:: sudachipy.Dictionary

Example

>>> tokenizer = SudachiTokenizer()
>>> text = texts = ["これはテスト文です。", "別のテキストもトークン化します。"]
>>> tokens = tokenizer.tokenize(text)
>>> tokens[0]
['これ', 'は', 'テスト', '文', 'です', '。']

classmethod from_pretrained(path: str | Path)

Loads tokenizer from a saved configuration.

Parameters:: path (str or Path) – Directory path to the saved configuration.
Returns:: Tokenizer instance initialized with the saved configuration.
Return type:: SudachiTokenizer

save_pretrained(path: str | Path)

Saves tokenizer configuration to the specified path.

Parameters:: path (str or Path) – Directory path to save the configuration.

tokenize(inputs: str | List[str]) → List[List[str]]

Tokenizes input text.

Parameters:: inputs (str or List[str]) – Text or list of texts to tokenize.
Returns:: List of tokenized texts.
Return type:: List[List[str]]

Example

>>> tokenizer = SudachiTokenizer()
>>> text = texts = ["これはテスト文です。", "別のテキストもトークン化します。"]
>>> tokens = tokenizer.tokenize(text)
>>> tokens[0]
['これ', 'は', 'テスト', '文', 'です', '。']

sumire.tokenizer package

Submodules

sumire.tokenizer.auto module

sumire.tokenizer.base module

sumire.tokenizer.jumanpp module

sumire.tokenizer.mecab module

sumire.tokenizer.spacy_ginza module

sumire.tokenizer.spm module

sumire.tokenizer.sudachi module

Module contents