sumire.vectorizer package
Subpackages
Submodules
sumire.vectorizer.count module
- class sumire.vectorizer.count.CountVectorizer(tokenizer: str | TokenizerType = 'mecab', lowercase: bool = True, stop_words: List[str] | None = None, ngram_range: Tuple = (1, 1), max_df: float = 1.0, min_df: int = 1, max_features: int | None = None, *args, **kwargs)
- Bases: - SkLearnVectorizerBase- CountVectorizer is a vectorizer class that use CountVectorizer module implemented in scikit-learn. This module wrap japanese text tokenization before inputs scikit-learn CountVectorizer. - Parameters:
- tokenizer (Union[str, TokenizerType]) – The tokenizer to use for tokenization. It can be a tokenizer instance, the name of a pretrained tokenizer, or the name of a built-in tokenizer. 
- lowercase (bool, optional) – Whether to convert all characters to lowercase before tokenization. Defaults to True. 
- stop_words (str, List[str], or None, optional) – The stop words to use for filtering tokens. Defaults to None. 
- ngram_range (tuple, optional) – The range of n-grams to extract as features. Defaults to (1, 1) (i.e., only unigrams). 
- max_df (float or int, optional) – The maximum document frequency for a token to be included in the vocabulary. Can be a float in the range [0.0, 1.0] or an integer. Defaults to 1.0 (i.e., no filtering). 
- min_df (float or int, optional) – The minimum document frequency for a token to be included in the vocabulary. Can be a float in the range [0.0, 1.0] or an integer. Defaults to 1 (i.e., no filtering). 
- max_features (int or None, optional) – The maximum number of features (tokens) to include in the vocabulary. Defaults to None (i.e., no limit). 
 
- Returns:
- None 
 - Example - >>> from sumire.tokenizer import MecabTokenizer >>> texts = ["これはテスト文です。", "別のテキストもトークン化します。"] >>> tokenizer = MecabTokenizer() >>> vectorizer = CountVectorizer(tokenizer=tokenizer) >>> vectorizer.fit(texts) >>> transformed = vectorizer.transform(texts) - classmethod from_pretrained(path: str | Path)
- Loads a pretrained CountVectorizer from the specified path. - Parameters:
- path (Union[str, Path]) – The directory path where the pretrained CountVectorizer is saved. 
- Returns:
- A CountVectorizer instance loaded with the pretrained model and configuration. 
- Return type:
 
 
sumire.vectorizer.swem module
- class sumire.vectorizer.swem.ModelCard(*, name: str, url: str, tokenizer_name: str = 'mecab', description: str = '')
- Bases: - BaseModel- description: str
 - model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
- A dictionary of computed field names and their corresponding ComputedFieldInfo objects. 
 - model_config: ClassVar[ConfigDict] = {}
- Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict]. 
 - model_fields: ClassVar[dict[str, FieldInfo]] = {'description': FieldInfo(annotation=str, required=False, default=''), 'name': FieldInfo(annotation=str, required=True), 'tokenizer_name': FieldInfo(annotation=str, required=False, default='mecab'), 'url': FieldInfo(annotation=str, required=True)}
- Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo]. - This replaces Model.__fields__ from Pydantic V1. 
 - name: str
 - tokenizer_name: str
 - url: str
 
- class sumire.vectorizer.swem.W2VSWEMVectorizer(model_name_or_path: str = '20190520/jawiki.word_vectors.100d', pooling_method: Literal['mean', 'max'] = 'mean', download_timeout: int = 3600, tokenizer: BaseTokenizer | None = None)
- Bases: - BaseVectorizer- W2VSWEMVectorizer is a vectorizer class that uses Word2Vec-based embeddings for text data. - To use chive model, give model_name_or_path to “chive-{version}-mc{min count}” such as W2VSWEMVectorizer(“chive-1.0-mc5”). The model alias name are the “name” key in “sumire/resource/model_card/gensim/chive/*.json”. - To use cl-tohoku japanese wikipedia entity vectors, give model_name_or_path to “{releas date}/jawiki.{all|entity|word}_vectors.{dimension}d” such as W2VSWEMVectorizer(“20180402/jawiki.entity_vectors.100d.json”) The model alias name are the “name” key of “sumire/resource/model_card/gensim/cl-tohoku_jawiki_vector/**/*.json”. - Parameters:
- model_name_or_path (str, optional) – - The model name or path to Word2Vec embeddings. Default is “20190520/jawiki.word_vectors.100d”. - The alias names are in name key of resources/model_card/gensim/**/*.json 
- pooling_method (str, optional) – The pooling method for aggregating word vectors (“mean” or “max”). Default is “mean”. 
- download_timeout (int, optional) – The timeout for downloading embeddings. Default is 3600. 
- tokenizer (BaseTokenizer, optional) – The tokenizer to use. If not provided, a default MecabTokenizer is used. 
 
 - w2v_dir
- The directory for storing Word2Vec embeddings. - Type:
- Path 
 
 - Examples - >>> from sumire.vectorizer.swem import W2VSWEMVectorizer >>> vectorizer = W2VSWEMVectorizer() >>> texts = ["これはテスト文です。", "別のテキストもトークン化します。"] >>> vectors = vectorizer.transform(texts) >>> vectors.shape (2, 100) - fit(texts: str | List[str], *args, **kwargs) None
- Fit the vectorizer (not implemented). - Parameters:
- texts (str or List[str]) – Input texts for fitting the vectorizer. 
- Returns:
- None 
 
 - classmethod from_pretrained(path: str | Path)
- Load a pretrained vectorizer from a specified path. - Parameters:
- path (str or Path) – The directory path to load the pretrained vectorizer from. 
- Returns:
- The loaded pretrained vectorizer. 
- Return type:
 
 - get_token_vectors(texts: str | List[str]) List[List[Tuple[str, array]]]
- Tokenizes each input text and obtains a tuple list of (token, token_vector) for each input text. - Parameters:
- texts (TokenizerInputs) – The input text or a list of texts to tokenize. 
- Returns:
- Each internal list consists of
- a tuple of tokenized words and their vector representations. 
 
- Return type:
- EncodeTokensOutputs 
 - Example - >>> from sumire.vectorizer.swem import W2VSWEMVectorizer >>> vectorizer = W2VSWEMVectorizer() >>> texts = ["これはテスト文です。", "別のテキストもトークン化します。"] >>> vectors = vectorizer.get_token_vectors(texts) >>> len(vectors) 2 >>> isinstance(vectors[0][0][0], str) True >>> vectors[0][0][1].shape == (100, ) True 
 - save_pretrained(path: str | Path) None
- Save the vectorizer and tokenizer to a specified path. - Parameters:
- path (str or Path) – The directory path to save the vectorizer. 
- Returns:
- None 
 
 - setup_w2v_if_not_installed(path_or_url: str) Path
- Set up Word2Vec embeddings if they are not already installed. - Parameters:
- path_or_url (str) – The path or URL to Word2Vec embeddings. 
- Returns:
- The path to the Word2Vec binary file. 
- Return type:
- Path 
 
 - transform(texts: str | List[str], *args, **kwargs) array
- Transform input texts into word vectors. - Parameters:
- texts (str or List[str]) – Input texts to be transformed. 
- Returns:
- Transformed text vectors. 
- Return type:
- np.array 
 
 - w2v_dir = PosixPath('/home/runner/.local/sumire/gensim')
 
sumire.vectorizer.tfidf module
- class sumire.vectorizer.tfidf.TfidfVectorizer(tokenizer: str | TokenizerType = 'mecab', lowercase: bool = True, stop_words: List[str] | None = None, ngram_range: Tuple = (1, 1), max_df: float = 1.0, min_df: int = 1, max_features: int | None = None, norm: str = 'l2', use_idf: bool = True, smooth_idf: bool = True, sublinear_tf: bool = False, *args, **kwargs)
- Bases: - SkLearnVectorizerBase- TfidfVectorizer is a vectorizer class that uses TfIdf implemented in scikit-learn. This module wrap japanese text tokenization before inputs scikit-learn TfidfVectorizer. - Parameters:
- tokenizer – The tokenizer to use for tokenization. 
- lowercase (bool, optional) – Whether to convert all characters to lowercase before tokenization. Defaults to True. 
- stop_words (str, List[str], or None, optional) – The stop words to use for filtering tokens. Defaults to None. 
- ngram_range (tuple, optional) – The range of n-grams to extract as features. Defaults to (1, 1) (i.e., only unigrams). 
- max_df (float or int, optional) – The maximum document frequency for a token to be included in the vocabulary. Can be a float in the range [0.0, 1.0] or an integer. Defaults to 1.0 (i.e., no filtering). 
- min_df (float or int, optional) – The minimum document frequency for a token to be included in the vocabulary. Can be a float in the range [0.0, 1.0] or an integer. Defaults to 1 (i.e., no filtering). 
- max_features (int or None, optional) – The maximum number of features (tokens) to include in the vocabulary. Defaults to None (i.e., no limit). 
- norm (str, optional) – The normalization method for tf-idf vectors. Defaults to “l2”. 
- use_idf (bool, optional) – Whether to use inverse document frequency in tf-idf computation. Defaults to True. 
- smooth_idf (bool, optional) – Whether to smooth idf weights. Defaults to True. sublinear_tf (bool, optional): Whether to apply sublinear tf scaling. Defaults to False. 
 
- Returns:
- None 
 - Example - >>> from sumire.tokenizer import MecabTokenizer >>> texts = ["これはテスト文です。", "別のテキストもトークン化します。"] >>> tokenizer = MecabTokenizer() >>> vectorizer = TfidfVectorizer(tokenizer=tokenizer) >>> vectorizer.fit(texts) >>> transformed = vectorizer.transform(texts) - classmethod from_pretrained(path: str | Path)
- Loads a pretrained TfidfVectorizer from the specified path. - Parameters:
- path (Union[str, Path]) – The directory path where the pretrained TfidfVectorizer is saved. 
- Returns:
- A TfidfVectorizer instance loaded with the pretrained model and configuration. 
- Return type:
 - Example - >>> pretrained_path = "/path/to/pretrained_model" >>> vectorizer = TfidfVectorizer.from_pretrained(pretrained_path) 
 
sumire.vectorizer.transformer_emb module
- class sumire.vectorizer.transformer_emb.ModelCard(*, model_name: str, description: str = '')
- Bases: - BaseModel- description: str
 - model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}
- A dictionary of computed field names and their corresponding ComputedFieldInfo objects. 
 - model_config: ClassVar[ConfigDict] = {}
- Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict]. 
 - model_fields: ClassVar[dict[str, FieldInfo]] = {'description': FieldInfo(annotation=str, required=False, default=''), 'model_name': FieldInfo(annotation=str, required=True)}
- Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo]. - This replaces Model.__fields__ from Pydantic V1. 
 - model_name: str
 
- class sumire.vectorizer.transformer_emb.TransformerEmbeddingVectorizer(pretrained_model_name_or_path: str = 'cl-tohoku/bert-base-japanese-v3', pooling_method: Literal['cls', 'mean', 'max'] = 'cls', batch_size: int = 32, max_length: int | None = None, model: PreTrainedModel | None = None, tokenizer: PreTrainedTokenizer | None = None)
- Bases: - TransformersVectorizerBase- TransformerEmbeddingVectorizer is a vectorizer class that uses
- transformer-based embeddings (e.g., BERT) for text data. 
 - Tested model infomations are in /sumire/resources/model_card/transformers. - Parameters:
- pretrained_model_name_or_path (str, optional) – The pretrained model name or path. Default is “cl-tohoku/bert-base-japanese-v3”. 
- pooling_method (str, optional) – The pooling method for aggregating embeddings (“cls”, “mean”, “max”). Default is “cls”. 
- batch_size (int, optional) – The batch size for processing texts. Default is 32. 
- max_length (int, optional) – The maximum length of input sequences. If not provided, it is determined by the model’s configuration. 
- model (PreTrainedModel, optional) – A pretrained transformer model. If not provided, it is loaded from the specified model name or path. 
- tokenizer (PreTrainedTokenizer, optional) – A pretrained tokenizer. If not provided, it is loaded from the specified model name or path. 
 
 - Examples - >>> from sumire.vectorizer.transformer_emb import TransformerEmbeddingVectorizer >>> vectorizer = TransformerEmbeddingVectorizer() >>> texts = ["This is a sample sentence.", "Another example."] >>> vectors = vectorizer.transform(texts) >>> vectors.shape (2, 768) # Assuming a BERT model with 768-dimensional embeddings - classmethod from_pretrained(path: str | Path)
- Load a pretrained TransformerEmbeddingVectorizer from a specified path. - Parameters:
- path (str or Path) – The directory path to load the pretrained vectorizer from. 
- Returns:
- The loaded pretrained vectorizer. 
- Return type:
 
 - get_token_vectors(texts: str | List[str]) List[List[Tuple[str, array]]]
- Tokenizes each input text and obtains a tuple list of (token, token_vector) for each input text. - Parameters:
- texts (TokenizerInputs) – The input text or a list of texts to tokenize. 
- Returns:
- Each internal list consists of a tuple of
- tokenized words and their vector representations. 
 
- Return type:
- EncodeTokensOutputs 
 - Examples - >>> import numpy as np >>> from sumire.vectorizer.transformer_emb import TransformerEmbeddingVectorizer >>> vectorizer = TransformerEmbeddingVectorizer() >>> texts = ["これはテスト文です。", "別のテキストもトークン化します。"] >>> vectors = vectorizer.get_token_vectors(texts) >>> len(vectors) 2 >>> isinstance(vectors[0][0][0], str) True >>> vectors[0][0][1].shape == (768, ) True 
 - transform(texts: str | List[str], *args, **kwargs) array
- Transform input texts into transformer-based embeddings. - Parameters:
- texts (str or List[str]) – Input texts to be transformed. 
- batch_size (int, optional) – The batch size for processing texts. Default is 32. 
 
- Returns:
- Transformed embeddings. 
- Return type:
- np.array 
 - Examples - >>> from sumire.vectorizer.transformer_emb import TransformerEmbeddingVectorizer >>> vectorizer = TransformerEmbeddingVectorizer() >>> texts = ["This is a sample sentence.", "Another example."] >>> vectors = vectorizer.transform(texts) >>> vectors.shape (2, 768) # Assuming a BERT model with 768-dimensional embeddings