sumire.vectorizer.base package

Submodules

sumire.vectorizer.base.common module

class sumire.vectorizer.base.common.BaseVectorizer

Bases: ABC

abstract fit(texts: str | List[str], *args, **kwargs) None

Fits the vectorizer to the input texts.

Parameters:

texts (Union[str, List[str]]) – The input texts or a list of texts to fit the vectorizer.

Returns:

None

fit_transform(texts: str | List[str], *args, **kwargs) array

Train vectorizer and transform texts.

Parameters:

texts (Union[str, List[str]]) – The input texts or a list of texts to fit the vectorizer.

Returns:

An array of numerical vectors representing the input texts.

Return type:

np.array

abstract classmethod from_pretrained(path: str | Path)

Loads a pretrained vectorizer from the specified directory.

Parameters:

path (str or Path) – The directory path where the pretrained vectorizer is saved.

Returns:

An instance of the pretrained vectorizer.

Return type:

BaseVectorizer

Raises:

NotImplementedError – This method must be implemented in derived classes.

abstract save_pretrained(path: str | Path) None

Saves the pretrained vectorizer to the specified directory.

Parameters:

path (str or Path) – The directory path where the pretrained vectorizer will be saved.

Returns:

None

Raises:

NotImplementedError – This method must be implemented in derived classes.

tokenize(inputs: str | List[str]) List[List[str]]

Tokenizes the input text or list of texts.

Parameters:

inputs (TokenizerInputs) – The input text or a list of texts to tokenize.

Returns:

A list of lists, where each inner list represents tokenized words for a single input text.

Return type:

TokenizerOutputs

Raises:

NotImplementedError – This method must be implemented in derived classes.

abstract transform(texts: str | List[str], *args, **kwargs) array

Transforms the input texts into numerical vectors.

Parameters:

texts (Union[str, List[str]]) – The input texts or a list of texts to transform.

Returns:

An array of numerical vectors representing the input texts.

Return type:

np.array

Raises:

NotImplementedError – This method must be implemented in derived classes.

vectorizer_config_file = 'vectorizer_config.json'
class sumire.vectorizer.base.common.GetTokenVectorsMixIn

Bases: object

abstract get_token_vectors(texts: str | List[str]) List[List[Tuple[str, array]]]

Tokenizes each input text and obtains a tuple list of (token, token_vector) for each input text.

Parameters:

texts (TokenizerInputs) – The input text or a list of texts to tokenize.

Returns:

Each internal list consists of

a tuple of tokenized words and their vector representations.

Return type:

EncodeTokensOutputs

Raises:

NotImplementedError – This method must be implemented in derived classes.

sumire.vectorizer.base.sklearn_vectorizer_base module

class sumire.vectorizer.base.sklearn_vectorizer_base.SkLearnVectorizerBase

Bases: BaseVectorizer, ABC

decode(data: array) List[List[str]]

Decodes the numerical vectors into tokenized texts.

Parameters:

data (np.array) – The numerical vectors to decode.

Returns:

The decoded tokenized texts.

Return type:

List[List[str]]

fit(texts: str | List[str], *args, **kwargs) None

Fits the vectorizer to the input texts.

Parameters:

texts (TokenizerInputs) – The input texts to fit the vectorizer.

Returns:

None

classmethod from_pretrained(path: str | Path)

Loads a pretrained vectorizer from the specified directory.

Parameters:

path (str or Path) – The directory path where the pretrained vectorizer is saved.

Returns:

An instance of the pretrained vectorizer.

Return type:

BaseVectorizer

Raises:

NotImplementedError – This method must be implemented in derived classes.

save_pretrained(path: str | Path) None

Saves the pretrained vectorizer and tokenizer to the specified path.

Parameters:

path (str or Path) – The directory path where the pretrained vectorizer and tokenizer will be saved.

Returns:

None

transform(texts: str | List[str], *args, **kwargs) array

Transforms the input texts into numerical vectors.

Parameters:

texts (TokenizerInputs) – The input texts to transform.

Returns:

An array of numerical vectors representing the input texts.

Return type:

np.array

sumire.vectorizer.base.transformer_vectorizer_base module

class sumire.vectorizer.base.transformer_vectorizer_base.TransformersVectorizerBase

Bases: BaseVectorizer, GetTokenVectorsMixIn, ABC

fit(texts: str | List[str], *args, **kwargs) None

This method is not implemented for TransformersVectorizerBase.

Parameters:

texts (Union[str, List[str]]) – The input texts for fitting the vectorizer.

Returns:

None

classmethod load_init_args(path: str)

Loads initialization arguments for the vectorizer from the specified path.

Parameters:

path (str) – The directory path where the initialization arguments are stored.

Returns:

A dictionary containing the initialization arguments.

Return type:

dict

Raises:

ValueError – If the vectorizer configuration file does not exist at the specified path.

save_pretrained(path: str | Path) None

Saves the pretrained vectorizer, tokenizer, and model to the specified path.

Parameters:

path (str or Path) – The directory path where the pretrained vectorizer, tokenizer, and model will be saved.

Returns:

None

Module contents