sumire.vectorizer.base package
Submodules
sumire.vectorizer.base.common module
- class sumire.vectorizer.base.common.BaseVectorizer
Bases:
ABC
- abstract fit(texts: str | List[str], *args, **kwargs) None
Fits the vectorizer to the input texts.
- Parameters:
texts (Union[str, List[str]]) – The input texts or a list of texts to fit the vectorizer.
- Returns:
None
- fit_transform(texts: str | List[str], *args, **kwargs) array
Train vectorizer and transform texts.
- Parameters:
texts (Union[str, List[str]]) – The input texts or a list of texts to fit the vectorizer.
- Returns:
An array of numerical vectors representing the input texts.
- Return type:
np.array
- abstract classmethod from_pretrained(path: str | Path)
Loads a pretrained vectorizer from the specified directory.
- Parameters:
path (str or Path) – The directory path where the pretrained vectorizer is saved.
- Returns:
An instance of the pretrained vectorizer.
- Return type:
- Raises:
NotImplementedError – This method must be implemented in derived classes.
- abstract save_pretrained(path: str | Path) None
Saves the pretrained vectorizer to the specified directory.
- Parameters:
path (str or Path) – The directory path where the pretrained vectorizer will be saved.
- Returns:
None
- Raises:
NotImplementedError – This method must be implemented in derived classes.
- tokenize(inputs: str | List[str]) List[List[str]]
Tokenizes the input text or list of texts.
- Parameters:
inputs (TokenizerInputs) – The input text or a list of texts to tokenize.
- Returns:
A list of lists, where each inner list represents tokenized words for a single input text.
- Return type:
TokenizerOutputs
- Raises:
NotImplementedError – This method must be implemented in derived classes.
- abstract transform(texts: str | List[str], *args, **kwargs) array
Transforms the input texts into numerical vectors.
- Parameters:
texts (Union[str, List[str]]) – The input texts or a list of texts to transform.
- Returns:
An array of numerical vectors representing the input texts.
- Return type:
np.array
- Raises:
NotImplementedError – This method must be implemented in derived classes.
- vectorizer_config_file = 'vectorizer_config.json'
- class sumire.vectorizer.base.common.GetTokenVectorsMixIn
Bases:
object
- abstract get_token_vectors(texts: str | List[str]) List[List[Tuple[str, array]]]
Tokenizes each input text and obtains a tuple list of (token, token_vector) for each input text.
- Parameters:
texts (TokenizerInputs) – The input text or a list of texts to tokenize.
- Returns:
- Each internal list consists of
a tuple of tokenized words and their vector representations.
- Return type:
EncodeTokensOutputs
- Raises:
NotImplementedError – This method must be implemented in derived classes.
sumire.vectorizer.base.sklearn_vectorizer_base module
- class sumire.vectorizer.base.sklearn_vectorizer_base.SkLearnVectorizerBase
Bases:
BaseVectorizer
,ABC
- decode(data: array) List[List[str]]
Decodes the numerical vectors into tokenized texts.
- Parameters:
data (np.array) – The numerical vectors to decode.
- Returns:
The decoded tokenized texts.
- Return type:
List[List[str]]
- fit(texts: str | List[str], *args, **kwargs) None
Fits the vectorizer to the input texts.
- Parameters:
texts (TokenizerInputs) – The input texts to fit the vectorizer.
- Returns:
None
- classmethod from_pretrained(path: str | Path)
Loads a pretrained vectorizer from the specified directory.
- Parameters:
path (str or Path) – The directory path where the pretrained vectorizer is saved.
- Returns:
An instance of the pretrained vectorizer.
- Return type:
- Raises:
NotImplementedError – This method must be implemented in derived classes.
- save_pretrained(path: str | Path) None
Saves the pretrained vectorizer and tokenizer to the specified path.
- Parameters:
path (str or Path) – The directory path where the pretrained vectorizer and tokenizer will be saved.
- Returns:
None
- transform(texts: str | List[str], *args, **kwargs) array
Transforms the input texts into numerical vectors.
- Parameters:
texts (TokenizerInputs) – The input texts to transform.
- Returns:
An array of numerical vectors representing the input texts.
- Return type:
np.array
sumire.vectorizer.base.transformer_vectorizer_base module
- class sumire.vectorizer.base.transformer_vectorizer_base.TransformersVectorizerBase
Bases:
BaseVectorizer
,GetTokenVectorsMixIn
,ABC
- fit(texts: str | List[str], *args, **kwargs) None
This method is not implemented for TransformersVectorizerBase.
- Parameters:
texts (Union[str, List[str]]) – The input texts for fitting the vectorizer.
- Returns:
None
- classmethod load_init_args(path: str)
Loads initialization arguments for the vectorizer from the specified path.
- Parameters:
path (str) – The directory path where the initialization arguments are stored.
- Returns:
A dictionary containing the initialization arguments.
- Return type:
dict
- Raises:
ValueError – If the vectorizer configuration file does not exist at the specified path.
- save_pretrained(path: str | Path) None
Saves the pretrained vectorizer, tokenizer, and model to the specified path.
- Parameters:
path (str or Path) – The directory path where the pretrained vectorizer, tokenizer, and model will be saved.
- Returns:
None