sumire.vectorizer package
Subpackages
Submodules
sumire.vectorizer.count module
- class sumire.vectorizer.count.CountVectorizer(tokenizer: str | TokenizerType = 'mecab', lowercase: bool = True, stop_words: List[str] | None = None, ngram_range: Tuple = (1, 1), max_df: float = 1.0, min_df: int = 1, max_features: int | None = None, *args, **kwargs)
Bases:
SkLearnVectorizerBase
CountVectorizer is a vectorizer class that use CountVectorizer module implemented in scikit-learn. This module wrap japanese text tokenization before inputs scikit-learn CountVectorizer.
- Parameters:
tokenizer (Union[str, TokenizerType]) – The tokenizer to use for tokenization. It can be a tokenizer instance, the name of a pretrained tokenizer, or the name of a built-in tokenizer.
lowercase (bool, optional) – Whether to convert all characters to lowercase before tokenization. Defaults to True.
stop_words (str, List[str], or None, optional) – The stop words to use for filtering tokens. Defaults to None.
ngram_range (tuple, optional) – The range of n-grams to extract as features. Defaults to (1, 1) (i.e., only unigrams).
max_df (float or int, optional) – The maximum document frequency for a token to be included in the vocabulary. Can be a float in the range [0.0, 1.0] or an integer. Defaults to 1.0 (i.e., no filtering).
min_df (float or int, optional) – The minimum document frequency for a token to be included in the vocabulary. Can be a float in the range [0.0, 1.0] or an integer. Defaults to 1 (i.e., no filtering).
max_features (int or None, optional) – The maximum number of features (tokens) to include in the vocabulary. Defaults to None (i.e., no limit).
- Returns:
None
Example
>>> from sumire.tokenizer import MecabTokenizer >>> texts = ["これはテスト文です。", "別のテキストもトークン化します。"] >>> tokenizer = MecabTokenizer() >>> vectorizer = CountVectorizer(tokenizer=tokenizer) >>> vectorizer.fit(texts) >>> transformed = vectorizer.transform(texts)
- classmethod from_pretrained(path: str | Path)
Loads a pretrained CountVectorizer from the specified path.
- Parameters:
path (Union[str, Path]) – The directory path where the pretrained CountVectorizer is saved.
- Returns:
A CountVectorizer instance loaded with the pretrained model and configuration.
- Return type:
sumire.vectorizer.swem module
- class sumire.vectorizer.swem.ModelCard(*, name: str, url: str, tokenizer_name: str = 'mecab', description: str = '')
Bases:
BaseModel
- description: str
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'description': FieldInfo(annotation=str, required=False, default=''), 'name': FieldInfo(annotation=str, required=True), 'tokenizer_name': FieldInfo(annotation=str, required=False, default='mecab'), 'url': FieldInfo(annotation=str, required=True)}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- name: str
- tokenizer_name: str
- url: str
- class sumire.vectorizer.swem.W2VSWEMVectorizer(model_name_or_path: str = '20190520/jawiki.word_vectors.100d', pooling_method: Literal['mean', 'max'] = 'mean', download_timeout: int = 3600, tokenizer: BaseTokenizer | None = None)
Bases:
BaseVectorizer
W2VSWEMVectorizer is a vectorizer class that uses Word2Vec-based embeddings for text data.
To use chive model, give model_name_or_path to “chive-{version}-mc{min count}” such as W2VSWEMVectorizer(“chive-1.0-mc5”). The model alias name are the “name” key in “sumire/resource/model_card/gensim/chive/*.json”.
To use cl-tohoku japanese wikipedia entity vectors, give model_name_or_path to “{releas date}/jawiki.{all|entity|word}_vectors.{dimension}d” such as W2VSWEMVectorizer(“20180402/jawiki.entity_vectors.100d.json”) The model alias name are the “name” key of “sumire/resource/model_card/gensim/cl-tohoku_jawiki_vector/**/*.json”.
- Parameters:
model_name_or_path (str, optional) –
The model name or path to Word2Vec embeddings. Default is “20190520/jawiki.word_vectors.100d”.
The alias names are in name key of resources/model_card/gensim/**/*.json
pooling_method (str, optional) – The pooling method for aggregating word vectors (“mean” or “max”). Default is “mean”.
download_timeout (int, optional) – The timeout for downloading embeddings. Default is 3600.
tokenizer (BaseTokenizer, optional) – The tokenizer to use. If not provided, a default MecabTokenizer is used.
- w2v_dir
The directory for storing Word2Vec embeddings.
- Type:
Path
Examples
>>> from sumire.vectorizer.swem import W2VSWEMVectorizer >>> vectorizer = W2VSWEMVectorizer() >>> texts = ["これはテスト文です。", "別のテキストもトークン化します。"] >>> vectors = vectorizer.transform(texts) >>> vectors.shape (2, 100)
- fit(texts: str | List[str], *args, **kwargs) None
Fit the vectorizer (not implemented).
- Parameters:
texts (str or List[str]) – Input texts for fitting the vectorizer.
- Returns:
None
- classmethod from_pretrained(path: str | Path)
Load a pretrained vectorizer from a specified path.
- Parameters:
path (str or Path) – The directory path to load the pretrained vectorizer from.
- Returns:
The loaded pretrained vectorizer.
- Return type:
- get_token_vectors(texts: str | List[str]) List[List[Tuple[str, array]]]
Tokenizes each input text and obtains a tuple list of (token, token_vector) for each input text.
- Parameters:
texts (TokenizerInputs) – The input text or a list of texts to tokenize.
- Returns:
- Each internal list consists of
a tuple of tokenized words and their vector representations.
- Return type:
EncodeTokensOutputs
Example
>>> from sumire.vectorizer.swem import W2VSWEMVectorizer >>> vectorizer = W2VSWEMVectorizer() >>> texts = ["これはテスト文です。", "別のテキストもトークン化します。"] >>> vectors = vectorizer.get_token_vectors(texts) >>> len(vectors) 2 >>> isinstance(vectors[0][0][0], str) True >>> vectors[0][0][1].shape == (100, ) True
- save_pretrained(path: str | Path) None
Save the vectorizer and tokenizer to a specified path.
- Parameters:
path (str or Path) – The directory path to save the vectorizer.
- Returns:
None
- setup_w2v_if_not_installed(path_or_url: str) Path
Set up Word2Vec embeddings if they are not already installed.
- Parameters:
path_or_url (str) – The path or URL to Word2Vec embeddings.
- Returns:
The path to the Word2Vec binary file.
- Return type:
Path
- transform(texts: str | List[str], *args, **kwargs) array
Transform input texts into word vectors.
- Parameters:
texts (str or List[str]) – Input texts to be transformed.
- Returns:
Transformed text vectors.
- Return type:
np.array
- w2v_dir = PosixPath('/home/runner/.local/sumire/gensim')
sumire.vectorizer.tfidf module
- class sumire.vectorizer.tfidf.TfidfVectorizer(tokenizer: str | TokenizerType = 'mecab', lowercase: bool = True, stop_words: List[str] | None = None, ngram_range: Tuple = (1, 1), max_df: float = 1.0, min_df: int = 1, max_features: int | None = None, norm: str = 'l2', use_idf: bool = True, smooth_idf: bool = True, sublinear_tf: bool = False, *args, **kwargs)
Bases:
SkLearnVectorizerBase
TfidfVectorizer is a vectorizer class that uses TfIdf implemented in scikit-learn. This module wrap japanese text tokenization before inputs scikit-learn TfidfVectorizer.
- Parameters:
tokenizer – The tokenizer to use for tokenization.
lowercase (bool, optional) – Whether to convert all characters to lowercase before tokenization. Defaults to True.
stop_words (str, List[str], or None, optional) – The stop words to use for filtering tokens. Defaults to None.
ngram_range (tuple, optional) – The range of n-grams to extract as features. Defaults to (1, 1) (i.e., only unigrams).
max_df (float or int, optional) – The maximum document frequency for a token to be included in the vocabulary. Can be a float in the range [0.0, 1.0] or an integer. Defaults to 1.0 (i.e., no filtering).
min_df (float or int, optional) – The minimum document frequency for a token to be included in the vocabulary. Can be a float in the range [0.0, 1.0] or an integer. Defaults to 1 (i.e., no filtering).
max_features (int or None, optional) – The maximum number of features (tokens) to include in the vocabulary. Defaults to None (i.e., no limit).
norm (str, optional) – The normalization method for tf-idf vectors. Defaults to “l2”.
use_idf (bool, optional) – Whether to use inverse document frequency in tf-idf computation. Defaults to True.
smooth_idf (bool, optional) – Whether to smooth idf weights. Defaults to True. sublinear_tf (bool, optional): Whether to apply sublinear tf scaling. Defaults to False.
- Returns:
None
Example
>>> from sumire.tokenizer import MecabTokenizer >>> texts = ["これはテスト文です。", "別のテキストもトークン化します。"] >>> tokenizer = MecabTokenizer() >>> vectorizer = TfidfVectorizer(tokenizer=tokenizer) >>> vectorizer.fit(texts) >>> transformed = vectorizer.transform(texts)
- classmethod from_pretrained(path: str | Path)
Loads a pretrained TfidfVectorizer from the specified path.
- Parameters:
path (Union[str, Path]) – The directory path where the pretrained TfidfVectorizer is saved.
- Returns:
A TfidfVectorizer instance loaded with the pretrained model and configuration.
- Return type:
Example
>>> pretrained_path = "/path/to/pretrained_model" >>> vectorizer = TfidfVectorizer.from_pretrained(pretrained_path)
sumire.vectorizer.transformer_emb module
- class sumire.vectorizer.transformer_emb.ModelCard(*, model_name: str, description: str = '')
Bases:
BaseModel
- description: str
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'description': FieldInfo(annotation=str, required=False, default=''), 'model_name': FieldInfo(annotation=str, required=True)}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- model_name: str
- class sumire.vectorizer.transformer_emb.TransformerEmbeddingVectorizer(pretrained_model_name_or_path: str = 'cl-tohoku/bert-base-japanese-v3', pooling_method: Literal['cls', 'mean', 'max'] = 'cls', batch_size: int = 32, max_length: int | None = None, model: PreTrainedModel | None = None, tokenizer: PreTrainedTokenizer | None = None)
Bases:
TransformersVectorizerBase
- TransformerEmbeddingVectorizer is a vectorizer class that uses
transformer-based embeddings (e.g., BERT) for text data.
Tested model infomations are in /sumire/resources/model_card/transformers.
- Parameters:
pretrained_model_name_or_path (str, optional) – The pretrained model name or path. Default is “cl-tohoku/bert-base-japanese-v3”.
pooling_method (str, optional) – The pooling method for aggregating embeddings (“cls”, “mean”, “max”). Default is “cls”.
batch_size (int, optional) – The batch size for processing texts. Default is 32.
max_length (int, optional) – The maximum length of input sequences. If not provided, it is determined by the model’s configuration.
model (PreTrainedModel, optional) – A pretrained transformer model. If not provided, it is loaded from the specified model name or path.
tokenizer (PreTrainedTokenizer, optional) – A pretrained tokenizer. If not provided, it is loaded from the specified model name or path.
Examples
>>> from sumire.vectorizer.transformer_emb import TransformerEmbeddingVectorizer >>> vectorizer = TransformerEmbeddingVectorizer() >>> texts = ["This is a sample sentence.", "Another example."] >>> vectors = vectorizer.transform(texts) >>> vectors.shape (2, 768) # Assuming a BERT model with 768-dimensional embeddings
- classmethod from_pretrained(path: str | Path)
Load a pretrained TransformerEmbeddingVectorizer from a specified path.
- Parameters:
path (str or Path) – The directory path to load the pretrained vectorizer from.
- Returns:
The loaded pretrained vectorizer.
- Return type:
- get_token_vectors(texts: str | List[str]) List[List[Tuple[str, array]]]
Tokenizes each input text and obtains a tuple list of (token, token_vector) for each input text.
- Parameters:
texts (TokenizerInputs) – The input text or a list of texts to tokenize.
- Returns:
- Each internal list consists of a tuple of
tokenized words and their vector representations.
- Return type:
EncodeTokensOutputs
Examples
>>> import numpy as np >>> from sumire.vectorizer.transformer_emb import TransformerEmbeddingVectorizer >>> vectorizer = TransformerEmbeddingVectorizer() >>> texts = ["これはテスト文です。", "別のテキストもトークン化します。"] >>> vectors = vectorizer.get_token_vectors(texts) >>> len(vectors) 2 >>> isinstance(vectors[0][0][0], str) True >>> vectors[0][0][1].shape == (768, ) True
- transform(texts: str | List[str], *args, **kwargs) array
Transform input texts into transformer-based embeddings.
- Parameters:
texts (str or List[str]) – Input texts to be transformed.
batch_size (int, optional) – The batch size for processing texts. Default is 32.
- Returns:
Transformed embeddings.
- Return type:
np.array
Examples
>>> from sumire.vectorizer.transformer_emb import TransformerEmbeddingVectorizer >>> vectorizer = TransformerEmbeddingVectorizer() >>> texts = ["This is a sample sentence.", "Another example."] >>> vectors = vectorizer.transform(texts) >>> vectors.shape (2, 768) # Assuming a BERT model with 768-dimensional embeddings