sumire.vectorizer package

Subpackages

Submodules

sumire.vectorizer.count module

class sumire.vectorizer.count.CountVectorizer(tokenizer: str | TokenizerType = 'mecab', lowercase: bool = True, stop_words: List[str] | None = None, ngram_range: Tuple = (1, 1), max_df: float = 1.0, min_df: int = 1, max_features: int | None = None, *args, **kwargs)

Bases: SkLearnVectorizerBase

CountVectorizer is a vectorizer class that use CountVectorizer module implemented in scikit-learn. This module wrap japanese text tokenization before inputs scikit-learn CountVectorizer.

Parameters:
  • tokenizer (Union[str, TokenizerType]) – The tokenizer to use for tokenization. It can be a tokenizer instance, the name of a pretrained tokenizer, or the name of a built-in tokenizer.

  • lowercase (bool, optional) – Whether to convert all characters to lowercase before tokenization. Defaults to True.

  • stop_words (str, List[str], or None, optional) – The stop words to use for filtering tokens. Defaults to None.

  • ngram_range (tuple, optional) – The range of n-grams to extract as features. Defaults to (1, 1) (i.e., only unigrams).

  • max_df (float or int, optional) – The maximum document frequency for a token to be included in the vocabulary. Can be a float in the range [0.0, 1.0] or an integer. Defaults to 1.0 (i.e., no filtering).

  • min_df (float or int, optional) – The minimum document frequency for a token to be included in the vocabulary. Can be a float in the range [0.0, 1.0] or an integer. Defaults to 1 (i.e., no filtering).

  • max_features (int or None, optional) – The maximum number of features (tokens) to include in the vocabulary. Defaults to None (i.e., no limit).

Returns:

None

Example

>>> from sumire.tokenizer import MecabTokenizer
>>> texts = ["これはテスト文です。", "別のテキストもトークン化します。"]
>>> tokenizer = MecabTokenizer()
>>> vectorizer = CountVectorizer(tokenizer=tokenizer)
>>> vectorizer.fit(texts)
>>> transformed = vectorizer.transform(texts)
classmethod from_pretrained(path: str | Path)

Loads a pretrained CountVectorizer from the specified path.

Parameters:

path (Union[str, Path]) – The directory path where the pretrained CountVectorizer is saved.

Returns:

A CountVectorizer instance loaded with the pretrained model and configuration.

Return type:

CountVectorizer

sumire.vectorizer.swem module

class sumire.vectorizer.swem.ModelCard(*, name: str, url: str, tokenizer_name: str = 'mecab', description: str = '')

Bases: BaseModel

description: str
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'description': FieldInfo(annotation=str, required=False, default=''), 'name': FieldInfo(annotation=str, required=True), 'tokenizer_name': FieldInfo(annotation=str, required=False, default='mecab'), 'url': FieldInfo(annotation=str, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

name: str
tokenizer_name: str
url: str
class sumire.vectorizer.swem.W2VSWEMVectorizer(model_name_or_path: str = '20190520/jawiki.word_vectors.100d', pooling_method: Literal['mean', 'max'] = 'mean', download_timeout: int = 3600, tokenizer: BaseTokenizer | None = None)

Bases: BaseVectorizer

W2VSWEMVectorizer is a vectorizer class that uses Word2Vec-based embeddings for text data.

To use chive model, give model_name_or_path to “chive-{version}-mc{min count}” such as W2VSWEMVectorizer(“chive-1.0-mc5”). The model alias name are the “name” key in “sumire/resource/model_card/gensim/chive/*.json”.

To use cl-tohoku japanese wikipedia entity vectors, give model_name_or_path to “{releas date}/jawiki.{all|entity|word}_vectors.{dimension}d” such as W2VSWEMVectorizer(“20180402/jawiki.entity_vectors.100d.json”) The model alias name are the “name” key of “sumire/resource/model_card/gensim/cl-tohoku_jawiki_vector/**/*.json”.

Parameters:
  • model_name_or_path (str, optional) –

    The model name or path to Word2Vec embeddings. Default is “20190520/jawiki.word_vectors.100d”.

    The alias names are in name key of resources/model_card/gensim/**/*.json

  • pooling_method (str, optional) – The pooling method for aggregating word vectors (“mean” or “max”). Default is “mean”.

  • download_timeout (int, optional) – The timeout for downloading embeddings. Default is 3600.

  • tokenizer (BaseTokenizer, optional) – The tokenizer to use. If not provided, a default MecabTokenizer is used.

w2v_dir

The directory for storing Word2Vec embeddings.

Type:

Path

Examples

>>> from sumire.vectorizer.swem import W2VSWEMVectorizer
>>> vectorizer = W2VSWEMVectorizer()
>>> texts = ["これはテスト文です。", "別のテキストもトークン化します。"]
>>> vectors = vectorizer.transform(texts)
>>> vectors.shape
(2, 100)
fit(texts: str | List[str], *args, **kwargs) None

Fit the vectorizer (not implemented).

Parameters:

texts (str or List[str]) – Input texts for fitting the vectorizer.

Returns:

None

classmethod from_pretrained(path: str | Path)

Load a pretrained vectorizer from a specified path.

Parameters:

path (str or Path) – The directory path to load the pretrained vectorizer from.

Returns:

The loaded pretrained vectorizer.

Return type:

W2VSWEMVectorizer

get_token_vectors(texts: str | List[str]) List[List[Tuple[str, array]]]

Tokenizes each input text and obtains a tuple list of (token, token_vector) for each input text.

Parameters:

texts (TokenizerInputs) – The input text or a list of texts to tokenize.

Returns:

Each internal list consists of

a tuple of tokenized words and their vector representations.

Return type:

EncodeTokensOutputs

Example

>>> from sumire.vectorizer.swem import W2VSWEMVectorizer
>>> vectorizer = W2VSWEMVectorizer()
>>> texts = ["これはテスト文です。", "別のテキストもトークン化します。"]
>>> vectors = vectorizer.get_token_vectors(texts)
>>> len(vectors)
2
>>> isinstance(vectors[0][0][0], str)
True
>>> vectors[0][0][1].shape == (100, )
True
save_pretrained(path: str | Path) None

Save the vectorizer and tokenizer to a specified path.

Parameters:

path (str or Path) – The directory path to save the vectorizer.

Returns:

None

setup_w2v_if_not_installed(path_or_url: str) Path

Set up Word2Vec embeddings if they are not already installed.

Parameters:

path_or_url (str) – The path or URL to Word2Vec embeddings.

Returns:

The path to the Word2Vec binary file.

Return type:

Path

transform(texts: str | List[str], *args, **kwargs) array

Transform input texts into word vectors.

Parameters:

texts (str or List[str]) – Input texts to be transformed.

Returns:

Transformed text vectors.

Return type:

np.array

w2v_dir = PosixPath('/home/runner/.local/sumire/gensim')

sumire.vectorizer.tfidf module

class sumire.vectorizer.tfidf.TfidfVectorizer(tokenizer: str | TokenizerType = 'mecab', lowercase: bool = True, stop_words: List[str] | None = None, ngram_range: Tuple = (1, 1), max_df: float = 1.0, min_df: int = 1, max_features: int | None = None, norm: str = 'l2', use_idf: bool = True, smooth_idf: bool = True, sublinear_tf: bool = False, *args, **kwargs)

Bases: SkLearnVectorizerBase

TfidfVectorizer is a vectorizer class that uses TfIdf implemented in scikit-learn. This module wrap japanese text tokenization before inputs scikit-learn TfidfVectorizer.

Parameters:
  • tokenizer – The tokenizer to use for tokenization.

  • lowercase (bool, optional) – Whether to convert all characters to lowercase before tokenization. Defaults to True.

  • stop_words (str, List[str], or None, optional) – The stop words to use for filtering tokens. Defaults to None.

  • ngram_range (tuple, optional) – The range of n-grams to extract as features. Defaults to (1, 1) (i.e., only unigrams).

  • max_df (float or int, optional) – The maximum document frequency for a token to be included in the vocabulary. Can be a float in the range [0.0, 1.0] or an integer. Defaults to 1.0 (i.e., no filtering).

  • min_df (float or int, optional) – The minimum document frequency for a token to be included in the vocabulary. Can be a float in the range [0.0, 1.0] or an integer. Defaults to 1 (i.e., no filtering).

  • max_features (int or None, optional) – The maximum number of features (tokens) to include in the vocabulary. Defaults to None (i.e., no limit).

  • norm (str, optional) – The normalization method for tf-idf vectors. Defaults to “l2”.

  • use_idf (bool, optional) – Whether to use inverse document frequency in tf-idf computation. Defaults to True.

  • smooth_idf (bool, optional) – Whether to smooth idf weights. Defaults to True. sublinear_tf (bool, optional): Whether to apply sublinear tf scaling. Defaults to False.

Returns:

None

Example

>>> from sumire.tokenizer import MecabTokenizer
>>> texts = ["これはテスト文です。", "別のテキストもトークン化します。"]
>>> tokenizer = MecabTokenizer()
>>> vectorizer = TfidfVectorizer(tokenizer=tokenizer)
>>> vectorizer.fit(texts)
>>> transformed = vectorizer.transform(texts)
classmethod from_pretrained(path: str | Path)

Loads a pretrained TfidfVectorizer from the specified path.

Parameters:

path (Union[str, Path]) – The directory path where the pretrained TfidfVectorizer is saved.

Returns:

A TfidfVectorizer instance loaded with the pretrained model and configuration.

Return type:

TfidfVectorizer

Example

>>> pretrained_path = "/path/to/pretrained_model"
>>> vectorizer = TfidfVectorizer.from_pretrained(pretrained_path)

sumire.vectorizer.transformer_emb module

class sumire.vectorizer.transformer_emb.ModelCard(*, model_name: str, description: str = '')

Bases: BaseModel

description: str
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'description': FieldInfo(annotation=str, required=False, default=''), 'model_name': FieldInfo(annotation=str, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

model_name: str
class sumire.vectorizer.transformer_emb.TransformerEmbeddingVectorizer(pretrained_model_name_or_path: str = 'cl-tohoku/bert-base-japanese-v3', pooling_method: Literal['cls', 'mean', 'max'] = 'cls', batch_size: int = 32, max_length: int | None = None, model: PreTrainedModel | None = None, tokenizer: PreTrainedTokenizer | None = None)

Bases: TransformersVectorizerBase

TransformerEmbeddingVectorizer is a vectorizer class that uses

transformer-based embeddings (e.g., BERT) for text data.

Tested model infomations are in /sumire/resources/model_card/transformers.

Parameters:
  • pretrained_model_name_or_path (str, optional) – The pretrained model name or path. Default is “cl-tohoku/bert-base-japanese-v3”.

  • pooling_method (str, optional) – The pooling method for aggregating embeddings (“cls”, “mean”, “max”). Default is “cls”.

  • batch_size (int, optional) – The batch size for processing texts. Default is 32.

  • max_length (int, optional) – The maximum length of input sequences. If not provided, it is determined by the model’s configuration.

  • model (PreTrainedModel, optional) – A pretrained transformer model. If not provided, it is loaded from the specified model name or path.

  • tokenizer (PreTrainedTokenizer, optional) – A pretrained tokenizer. If not provided, it is loaded from the specified model name or path.

Examples

>>> from sumire.vectorizer.transformer_emb import TransformerEmbeddingVectorizer
>>> vectorizer = TransformerEmbeddingVectorizer()
>>> texts = ["This is a sample sentence.", "Another example."]
>>> vectors = vectorizer.transform(texts)
>>> vectors.shape
(2, 768)  # Assuming a BERT model with 768-dimensional embeddings
classmethod from_pretrained(path: str | Path)

Load a pretrained TransformerEmbeddingVectorizer from a specified path.

Parameters:

path (str or Path) – The directory path to load the pretrained vectorizer from.

Returns:

The loaded pretrained vectorizer.

Return type:

TransformerEmbeddingVectorizer

get_token_vectors(texts: str | List[str]) List[List[Tuple[str, array]]]

Tokenizes each input text and obtains a tuple list of (token, token_vector) for each input text.

Parameters:

texts (TokenizerInputs) – The input text or a list of texts to tokenize.

Returns:

Each internal list consists of a tuple of

tokenized words and their vector representations.

Return type:

EncodeTokensOutputs

Examples

>>> import numpy as np
>>> from sumire.vectorizer.transformer_emb import TransformerEmbeddingVectorizer
>>> vectorizer = TransformerEmbeddingVectorizer()
>>> texts = ["これはテスト文です。", "別のテキストもトークン化します。"]
>>> vectors = vectorizer.get_token_vectors(texts)
>>> len(vectors)
2
>>> isinstance(vectors[0][0][0], str)
True
>>> vectors[0][0][1].shape == (768, )
True
transform(texts: str | List[str], *args, **kwargs) array

Transform input texts into transformer-based embeddings.

Parameters:
  • texts (str or List[str]) – Input texts to be transformed.

  • batch_size (int, optional) – The batch size for processing texts. Default is 32.

Returns:

Transformed embeddings.

Return type:

np.array

Examples

>>> from sumire.vectorizer.transformer_emb import TransformerEmbeddingVectorizer
>>> vectorizer = TransformerEmbeddingVectorizer()
>>> texts = ["This is a sample sentence.", "Another example."]
>>> vectors = vectorizer.transform(texts)
>>> vectors.shape
(2, 768)  # Assuming a BERT model with 768-dimensional embeddings

Module contents