sumire.tokenizer package
Submodules
sumire.tokenizer.auto module
- class sumire.tokenizer.auto.AutoJapaneseTokenizer(path: str | None = None, *args, **kwargs)
Bases:
BaseTokenizer
AutoJapaneseTokenizer automatically selects the tokenizer based on the given path or uses MecabTokenizer if no path is provided.
- Parameters:
path (str, optional) – The directory path to a saved tokenizer configuration. If not provided, a simple MecabTokenizer is used.
Example
>>> tokenizer = AutoJapaneseTokenizer() >>> text = texts = ["これはテスト文です。", "別のテキストもトークン化します。"] >>> tokens = tokenizer.tokenize(text) >>> tokens[0] ['これ', 'は', 'テスト', '文', 'です', '。']
- classmethod from_pretrained(path: str | Path) TokenizerType
Loads a tokenizer from a saved configuration.
- Parameters:
path (str or Path) – Directory path to the saved configuration.
- Returns:
Tokenizer instance initialized with the saved configuration.
- Return type:
TokenizerType
- save_pretrained(path: str | Path)
Saves tokenizer configuration to the specified path.
- Parameters:
path (str or Path) – Directory path to save the configuration.
- tokenize(inputs: str | List[str]) List[List[str]]
Tokenizes input text.
- Parameters:
inputs (str or List[str]) – Text or list of texts to tokenize.
- Returns:
List of tokenized texts.
- Return type:
List[List[str]]
- sumire.tokenizer.auto.tokenizer_dict(name: str) Type[MecabTokenizer | JumanppTokenizer | SpacyGinzaTokenizer | SentencePieceTokenizer | SudachiTokenizer]
sumire.tokenizer.base module
- class sumire.tokenizer.base.BaseTokenizer(*args, **kwargs)
Bases:
ABC
- fit(inputs: str | List[str]) None
Training tokenizer if necessary.
- classmethod from_pretrained(path: str | Path)
Loads a pretrained tokenizer from the specified path.
- Parameters:
path (str or Path) – The directory path where the pretrained tokenizer is saved.
- Returns:
An instance of the pretrained tokenizer.
- Return type:
- Raises:
NotImplementedError – This method must be implemented in derived classes.
- abstract save_pretrained(path: str | Path)
Saves the pretrained tokenizer to the specified path.
- Parameters:
path (str or Path) – The directory path where the pretrained tokenizer will be saved.
- Raises:
NotImplementedError – This method must be implemented in derived classes.
- abstract tokenize(inputs: str | List[str]) List[List[str]]
Tokenizes the input text or list of texts.
- Parameters:
inputs (TokenizerInputs) – The input text or a list of texts to tokenize.
- Returns:
A list of lists, where each inner list represents tokenized words for a single input text.
- Return type:
TokenizerOutputs
- Raises:
NotImplementedError – This method must be implemented in derived classes.
- tokenizer_config_file = 'tokenizer_config.json'
sumire.tokenizer.jumanpp module
- class sumire.tokenizer.jumanpp.JumanppTokenizer(*args, **kwargs)
Bases:
BaseTokenizer
Tokenizer class using Juman++ for Japanese text tokenization.
- Example:
>>> tokenizer = JumanppTokenizer() >>> text = "これはテスト文です。" >>> tokens = tokenizer.tokenize(text) >>> tokens [['これ', 'は', 'テスト', '文', 'です', '。']]
- classmethod from_pretrained(path: str | Path)
Loads a tokenizer from a saved configuration.
- Parameters:
path (str or Path) – Directory path to the saved configuration.
- Returns:
Tokenizer instance initialized with the saved configuration.
- Return type:
- save_pretrained(path: str | Path)
Saves tokenizer configuration to the specified path.
- Parameters:
path (str or Path) – Directory path to save the configuration.
- tokenize(inputs: str | List[str]) List[List[str]]
Tokenizes input text.
- Parameters:
inputs (str or List[str]) – Text or list of texts to tokenize.
- Returns:
List of tokenized texts.
- Return type:
List[List[str]]
Example
>>> tokenizer = JumanppTokenizer() >>> text = "これはテスト文です。" >>> tokens = tokenizer.tokenize(text) >>> tokens [['これ', 'は', 'テスト', '文', 'です', '。']]
sumire.tokenizer.mecab module
- class sumire.tokenizer.mecab.MecabTokenizer(dictionary: Literal['unidic', 'unidic-lite', 'ipadic', 'mecab-ipadic-neologd', 'mecab-unidic-neologd'] = 'unidic', *args, **kwargs)
Bases:
BaseTokenizer
Tokenizer class using MeCab for Japanese text tokenization.
- Parameters:
dictionary (AvailableDictionaries, optional) – Dictionary to use for tokenization. Default is “unidic”.
Example
>>> tokenizer = MecabTokenizer() >>> text = texts = ["これはテスト文です。", "別のテキストもトークン化します。"] >>> tokens = tokenizer.tokenize(text) >>> tokens [['これ', 'は', 'テスト', '文', 'です', '。'], ['別', 'の', 'テキスト', 'も', 'トークン', '化', 'し', 'ます', '。']]
- classmethod from_pretrained(path: str | Path)
Loads a tokenizer from a saved configuration.
- Parameters:
path (str or Path) – Directory path to the saved configuration.
- Returns:
A tokenizer loaded from the specified configuration.
- Return type:
- save_pretrained(path: str | Path)
Saves tokenizer configuration to the specified path.
- Parameters:
path (str or Path) – Directory path to save the configuration.
- setup_tagger(dictionary: Literal['unidic', 'unidic-lite', 'ipadic', 'mecab-ipadic-neologd', 'mecab-unidic-neologd']) Tagger
Sets up the MeCab tagger based on the selected dictionary.
- Parameters:
dictionary (AvailableDictionaries) – Dictionary to use for tokenization.
- Returns:
MeCab tagger configured with the selected dictionary.
- Return type:
Tagger
- tokenize(inputs: str | List[str]) List[List[str]]
Tokenizes input text using MeCab.
- Parameters:
inputs (str or List[str]) – Text or list of texts to tokenize.
- Returns:
Tokenized texts as lists of strings.
- Return type:
TokenizerOutputs
Example
>>> tokenizer = MecabTokenizer() >>> texts = ["これはテスト文です。", "別のテキストもトークン化します。"] >>> tokens = tokenizer.tokenize(texts) >>> tokens [['これ', 'は', 'テスト', '文', 'です', '。'], ['別', 'の', 'テキスト', 'も', 'トークン', '化', 'し', 'ます', '。']]
sumire.tokenizer.spacy_ginza module
- class sumire.tokenizer.spacy_ginza.SpacyGinzaTokenizer(*args, **kwargs)
Bases:
BaseTokenizer
Tokenizer class using SpaCy with the Ginza model for Japanese text tokenization.
Example
>>> tokenizer = SpacyGinzaTokenizer() >>> text = "これはテスト文です。" >>> tokens = tokenizer.tokenize(text) >>> tokens [['これ', 'は', 'テスト', '文', 'です', '。']]
- classmethod from_pretrained(path: str | Path)
Loads a tokenizer from a saved configuration.
- Args:
path (str or Path): Directory path to the saved configuration.
- Returns:
Tokenizer instance initialized with the saved configuration.
- Return type:
- save_pretrained(path: str | Path)
Saves tokenizer configuration to the specified path.
- Parameters:
path (str or Path) – Directory path to save the configuration.
- tokenize(inputs: str | List[str]) List[List[str]]
Tokenizes input text using SpaCy with the Ginza model.
- Parameters:
inputs (str or List[str]) – Text or list of texts to tokenize.
- Returns:
List of tokenized texts.
- Return type:
List[List[str]]
Example
>>> tokenizer = SpacyGinzaTokenizer() >>> text = "これはテスト文です。" >>> tokens = tokenizer.tokenize(text) >>> tokens [['これ', 'は', 'テスト', '文', 'です', '。']]
sumire.tokenizer.spm module
- class sumire.tokenizer.spm.SentencePieceTokenizer(vocab_size: int = 32000, model_type: Literal['unigram', 'bpe', 'char', 'word'] = 'unigram', character_coverage: float = 0.995, spm_model: SentencePieceProcessor | None = None, *args, **kwargs)
Bases:
BaseTokenizer
Tokenizer class using SentencePiece for text tokenization.
- Parameters:
vocab_size (int, optional) – Vocabulary size. Default is 32000.
model_type (ModelType, optional) – Type of SentencePiece model. Default is “unigram”.
character_coverage (float, optional) – Character coverage for SentencePiece model. Default is 0.995.
spm_model (sentencepiece.SentencePieceProcessor, optional) – Pre-trained SentencePiece model. Default is None.
- decode(inputs: List[int] | List[List[int]]) List[str]
Decodes input tokens using SentencePiece.
- Parameters:
inputs (List[int] or List[List[int]]) – Encoded tokens to decode.
- Returns:
Decoded texts as strings.
- Return type:
List[str]
- encode(inputs: str | List[str]) List[List[int]]
Encodes input text using SentencePiece.
- Parameters:
inputs (str or List[str]) – Text or list of texts to encode.
- Returns:
Encoded tokens as lists of integers.
- Return type:
List[List[int]]
- fit(inputs: List[str]) None
Fits the tokenizer on a list of texts.
- Parameters:
inputs (List[str]) – List of texts to fit the tokenizer on.
- classmethod from_pretrained(path: str | Path)
Loads tokenizer from a saved configuration.
- Parameters:
path (str or Path) – Directory path to the saved configuration.
- Returns:
Tokenizer instance initialized with the saved configuration.
- Return type:
- model_cache_dir = PosixPath('/home/runner/.cache/sumire/sentencepiece/1f520c58-5577-4937-a1f3-87a1288f9a92')
- save_pretrained(path: str | Path)
Saves tokenizer configuration to the specified path.
- Parameters:
path (str or Path) – Directory path to save the configuration.
- tokenize(inputs: str | List[str]) List[List[str]]
Tokenizes input text using SentencePiece.
- Parameters:
inputs (str or List[str]) – Text or list of texts to tokenize.
- Returns:
Tokenized texts as strings.
- Return type:
TokenizerOutputs
- tokenizer_file_prefix = 'sentencepiece'
sumire.tokenizer.sudachi module
- class sumire.tokenizer.sudachi.SudachiTokenizer(dict_type: str = 'full', normalize: bool = False, *args, **kwargs)
Bases:
BaseTokenizer
Class for a custom tokenizer using SudachiPy.
- Parameters:
dict_type (str, optional) – Type of SudachiPy dictionary. Default is “full”.
normalize (bool, optional) – Flag to normalize tokens. Default is False.
- dict_type
Type of SudachiPy dictionary.
- Type:
str
- normalize
Flag to normalize tokens.
- Type:
bool
- tokenizer
SudachiPy tokenizer object.
- Type:
sudachipy.Dictionary
Example
>>> tokenizer = SudachiTokenizer() >>> text = texts = ["これはテスト文です。", "別のテキストもトークン化します。"] >>> tokens = tokenizer.tokenize(text) >>> tokens[0] ['これ', 'は', 'テスト', '文', 'です', '。']
- classmethod from_pretrained(path: str | Path)
Loads tokenizer from a saved configuration.
- Parameters:
path (str or Path) – Directory path to the saved configuration.
- Returns:
Tokenizer instance initialized with the saved configuration.
- Return type:
- save_pretrained(path: str | Path)
Saves tokenizer configuration to the specified path.
- Parameters:
path (str or Path) – Directory path to save the configuration.
- tokenize(inputs: str | List[str]) List[List[str]]
Tokenizes input text.
- Parameters:
inputs (str or List[str]) – Text or list of texts to tokenize.
- Returns:
List of tokenized texts.
- Return type:
List[List[str]]
Example
>>> tokenizer = SudachiTokenizer() >>> text = texts = ["これはテスト文です。", "別のテキストもトークン化します。"] >>> tokens = tokenizer.tokenize(text) >>> tokens[0] ['これ', 'は', 'テスト', '文', 'です', '。']