sumire.tokenizer package

Submodules

sumire.tokenizer.auto module

class sumire.tokenizer.auto.AutoJapaneseTokenizer(path: str | None = None, *args, **kwargs)

Bases: BaseTokenizer

AutoJapaneseTokenizer automatically selects the tokenizer based on the given path or uses MecabTokenizer if no path is provided.

Parameters:

path (str, optional) – The directory path to a saved tokenizer configuration. If not provided, a simple MecabTokenizer is used.

Example

>>> tokenizer = AutoJapaneseTokenizer()
>>> text = texts = ["これはテスト文です。", "別のテキストもトークン化します。"]
>>> tokens = tokenizer.tokenize(text)
>>> tokens[0]
['これ', 'は', 'テスト', '文', 'です', '。']
classmethod from_pretrained(path: str | Path) TokenizerType

Loads a tokenizer from a saved configuration.

Parameters:

path (str or Path) – Directory path to the saved configuration.

Returns:

Tokenizer instance initialized with the saved configuration.

Return type:

TokenizerType

save_pretrained(path: str | Path)

Saves tokenizer configuration to the specified path.

Parameters:

path (str or Path) – Directory path to save the configuration.

tokenize(inputs: str | List[str]) List[List[str]]

Tokenizes input text.

Parameters:

inputs (str or List[str]) – Text or list of texts to tokenize.

Returns:

List of tokenized texts.

Return type:

List[List[str]]

sumire.tokenizer.auto.tokenizer_dict(name: str) Type[MecabTokenizer | JumanppTokenizer | SpacyGinzaTokenizer | SentencePieceTokenizer | SudachiTokenizer]

sumire.tokenizer.base module

class sumire.tokenizer.base.BaseTokenizer(*args, **kwargs)

Bases: ABC

fit(inputs: str | List[str]) None

Training tokenizer if necessary.

classmethod from_pretrained(path: str | Path)

Loads a pretrained tokenizer from the specified path.

Parameters:

path (str or Path) – The directory path where the pretrained tokenizer is saved.

Returns:

An instance of the pretrained tokenizer.

Return type:

BaseTokenizer

Raises:

NotImplementedError – This method must be implemented in derived classes.

abstract save_pretrained(path: str | Path)

Saves the pretrained tokenizer to the specified path.

Parameters:

path (str or Path) – The directory path where the pretrained tokenizer will be saved.

Raises:

NotImplementedError – This method must be implemented in derived classes.

abstract tokenize(inputs: str | List[str]) List[List[str]]

Tokenizes the input text or list of texts.

Parameters:

inputs (TokenizerInputs) – The input text or a list of texts to tokenize.

Returns:

A list of lists, where each inner list represents tokenized words for a single input text.

Return type:

TokenizerOutputs

Raises:

NotImplementedError – This method must be implemented in derived classes.

tokenizer_config_file = 'tokenizer_config.json'

sumire.tokenizer.jumanpp module

class sumire.tokenizer.jumanpp.JumanppTokenizer(*args, **kwargs)

Bases: BaseTokenizer

Tokenizer class using Juman++ for Japanese text tokenization.

Example:
>>> tokenizer = JumanppTokenizer()
>>> text = "これはテスト文です。"
>>> tokens = tokenizer.tokenize(text)
>>> tokens
[['これ', 'は', 'テスト', '文', 'です', '。']]
classmethod from_pretrained(path: str | Path)

Loads a tokenizer from a saved configuration.

Parameters:

path (str or Path) – Directory path to the saved configuration.

Returns:

Tokenizer instance initialized with the saved configuration.

Return type:

JumanppTokenizer

save_pretrained(path: str | Path)

Saves tokenizer configuration to the specified path.

Parameters:

path (str or Path) – Directory path to save the configuration.

tokenize(inputs: str | List[str]) List[List[str]]

Tokenizes input text.

Parameters:

inputs (str or List[str]) – Text or list of texts to tokenize.

Returns:

List of tokenized texts.

Return type:

List[List[str]]

Example

>>> tokenizer = JumanppTokenizer()
>>> text = "これはテスト文です。"
>>> tokens = tokenizer.tokenize(text)
>>> tokens
[['これ', 'は', 'テスト', '文', 'です', '。']]

sumire.tokenizer.mecab module

class sumire.tokenizer.mecab.MecabTokenizer(dictionary: Literal['unidic', 'unidic-lite', 'ipadic', 'mecab-ipadic-neologd', 'mecab-unidic-neologd'] = 'unidic', *args, **kwargs)

Bases: BaseTokenizer

Tokenizer class using MeCab for Japanese text tokenization.

Parameters:

dictionary (AvailableDictionaries, optional) – Dictionary to use for tokenization. Default is “unidic”.

Example

>>> tokenizer = MecabTokenizer()
>>> text = texts = ["これはテスト文です。", "別のテキストもトークン化します。"]
>>> tokens = tokenizer.tokenize(text)
>>> tokens
[['これ', 'は', 'テスト', '文', 'です', '。'], ['別', 'の', 'テキスト', 'も', 'トークン', '化', 'し', 'ます', '。']]
classmethod from_pretrained(path: str | Path)

Loads a tokenizer from a saved configuration.

Parameters:

path (str or Path) – Directory path to the saved configuration.

Returns:

A tokenizer loaded from the specified configuration.

Return type:

MecabTokenizer

save_pretrained(path: str | Path)

Saves tokenizer configuration to the specified path.

Parameters:

path (str or Path) – Directory path to save the configuration.

setup_tagger(dictionary: Literal['unidic', 'unidic-lite', 'ipadic', 'mecab-ipadic-neologd', 'mecab-unidic-neologd']) Tagger

Sets up the MeCab tagger based on the selected dictionary.

Parameters:

dictionary (AvailableDictionaries) – Dictionary to use for tokenization.

Returns:

MeCab tagger configured with the selected dictionary.

Return type:

Tagger

tokenize(inputs: str | List[str]) List[List[str]]

Tokenizes input text using MeCab.

Parameters:

inputs (str or List[str]) – Text or list of texts to tokenize.

Returns:

Tokenized texts as lists of strings.

Return type:

TokenizerOutputs

Example

>>> tokenizer = MecabTokenizer()
>>> texts = ["これはテスト文です。", "別のテキストもトークン化します。"]
>>> tokens = tokenizer.tokenize(texts)
>>> tokens
[['これ', 'は', 'テスト', '文', 'です', '。'], ['別', 'の', 'テキスト', 'も', 'トークン', '化', 'し', 'ます', '。']]

sumire.tokenizer.spacy_ginza module

class sumire.tokenizer.spacy_ginza.SpacyGinzaTokenizer(*args, **kwargs)

Bases: BaseTokenizer

Tokenizer class using SpaCy with the Ginza model for Japanese text tokenization.

Example

>>> tokenizer = SpacyGinzaTokenizer()
>>> text = "これはテスト文です。"
>>> tokens = tokenizer.tokenize(text)
>>> tokens
[['これ', 'は', 'テスト', '文', 'です', '。']]
classmethod from_pretrained(path: str | Path)

Loads a tokenizer from a saved configuration.

Args:

path (str or Path): Directory path to the saved configuration.

Returns:

Tokenizer instance initialized with the saved configuration.

Return type:

SpacyGinzaTokenizer

save_pretrained(path: str | Path)

Saves tokenizer configuration to the specified path.

Parameters:

path (str or Path) – Directory path to save the configuration.

tokenize(inputs: str | List[str]) List[List[str]]

Tokenizes input text using SpaCy with the Ginza model.

Parameters:

inputs (str or List[str]) – Text or list of texts to tokenize.

Returns:

List of tokenized texts.

Return type:

List[List[str]]

Example

>>> tokenizer = SpacyGinzaTokenizer()
>>> text = "これはテスト文です。"
>>> tokens = tokenizer.tokenize(text)
>>> tokens
[['これ', 'は', 'テスト', '文', 'です', '。']]

sumire.tokenizer.spm module

class sumire.tokenizer.spm.SentencePieceTokenizer(vocab_size: int = 32000, model_type: Literal['unigram', 'bpe', 'char', 'word'] = 'unigram', character_coverage: float = 0.995, spm_model: SentencePieceProcessor | None = None, *args, **kwargs)

Bases: BaseTokenizer

Tokenizer class using SentencePiece for text tokenization.

Parameters:
  • vocab_size (int, optional) – Vocabulary size. Default is 32000.

  • model_type (ModelType, optional) – Type of SentencePiece model. Default is “unigram”.

  • character_coverage (float, optional) – Character coverage for SentencePiece model. Default is 0.995.

  • spm_model (sentencepiece.SentencePieceProcessor, optional) – Pre-trained SentencePiece model. Default is None.

decode(inputs: List[int] | List[List[int]]) List[str]

Decodes input tokens using SentencePiece.

Parameters:

inputs (List[int] or List[List[int]]) – Encoded tokens to decode.

Returns:

Decoded texts as strings.

Return type:

List[str]

encode(inputs: str | List[str]) List[List[int]]

Encodes input text using SentencePiece.

Parameters:

inputs (str or List[str]) – Text or list of texts to encode.

Returns:

Encoded tokens as lists of integers.

Return type:

List[List[int]]

fit(inputs: List[str]) None

Fits the tokenizer on a list of texts.

Parameters:

inputs (List[str]) – List of texts to fit the tokenizer on.

classmethod from_pretrained(path: str | Path)

Loads tokenizer from a saved configuration.

Parameters:

path (str or Path) – Directory path to the saved configuration.

Returns:

Tokenizer instance initialized with the saved configuration.

Return type:

SentencePieceTokenizer

model_cache_dir = PosixPath('/home/runner/.cache/sumire/sentencepiece/1f520c58-5577-4937-a1f3-87a1288f9a92')
save_pretrained(path: str | Path)

Saves tokenizer configuration to the specified path.

Parameters:

path (str or Path) – Directory path to save the configuration.

tokenize(inputs: str | List[str]) List[List[str]]

Tokenizes input text using SentencePiece.

Parameters:

inputs (str or List[str]) – Text or list of texts to tokenize.

Returns:

Tokenized texts as strings.

Return type:

TokenizerOutputs

tokenizer_file_prefix = 'sentencepiece'

sumire.tokenizer.sudachi module

class sumire.tokenizer.sudachi.SudachiTokenizer(dict_type: str = 'full', normalize: bool = False, *args, **kwargs)

Bases: BaseTokenizer

Class for a custom tokenizer using SudachiPy.

Parameters:
  • dict_type (str, optional) – Type of SudachiPy dictionary. Default is “full”.

  • normalize (bool, optional) – Flag to normalize tokens. Default is False.

dict_type

Type of SudachiPy dictionary.

Type:

str

normalize

Flag to normalize tokens.

Type:

bool

tokenizer

SudachiPy tokenizer object.

Type:

sudachipy.Dictionary

Example

>>> tokenizer = SudachiTokenizer()
>>> text = texts = ["これはテスト文です。", "別のテキストもトークン化します。"]
>>> tokens = tokenizer.tokenize(text)
>>> tokens[0]
['これ', 'は', 'テスト', '文', 'です', '。']
classmethod from_pretrained(path: str | Path)

Loads tokenizer from a saved configuration.

Parameters:

path (str or Path) – Directory path to the saved configuration.

Returns:

Tokenizer instance initialized with the saved configuration.

Return type:

SudachiTokenizer

save_pretrained(path: str | Path)

Saves tokenizer configuration to the specified path.

Parameters:

path (str or Path) – Directory path to save the configuration.

tokenize(inputs: str | List[str]) List[List[str]]

Tokenizes input text.

Parameters:

inputs (str or List[str]) – Text or list of texts to tokenize.

Returns:

List of tokenized texts.

Return type:

List[List[str]]

Example

>>> tokenizer = SudachiTokenizer()
>>> text = texts = ["これはテスト文です。", "別のテキストもトークン化します。"]
>>> tokens = tokenizer.tokenize(text)
>>> tokens[0]
['これ', 'は', 'テスト', '文', 'です', '。']

Module contents