Setfit unpacked: When Sentence transformers go to ‘classification Gym’

#ModelEnhancement

#AdvancedML

#NLPTools

Setfit unpacked: When Sentence transformers go to ‘classification Gym’

Introduction:

‍

Sentence transformers are a significant advancement in natural language processing, enabling the conversion of textual data into meaningful vector representations or ‘embeddings’. These embeddings effectively encapsulate the contextual and semantic nuances of sentences, making them invaluable for various machine learning applications.

Among the most prominent sentence transformers is BERT (Bidirectional Encoder Representations from Transformers), known for its deep understanding of context and language structure. Other notable models include RoBERTa and DistilBERT, each contributing uniquely to the landscape of natural language understanding.

These models are particularly adept at generating embeddings that can be utilized as foundational inputs for various machine learning tasks. By leveraging the pre-trained knowledge of these transformers, it’s possible to enhance the performance of other machine learning models in tasks such as sentiment analysis, text classification, and more. This process of transfer learning is crucial, as it allows for the application of advanced linguistic understanding to a broad range of computational challenges.

In this technical article, we explore Few-Shot Learning and how its fusion with Contrastive Learning enhances frameworks like SetFit. We also touch upon the crucial aspects of fine-tuning and hyperparameter tuning for optimal model performance :)

Fine-Tuning Challenge:

‍

Fine-tuning sentence transformers, which are powerful tools for understanding language, often requires a lot of data. This can be a big problem, especially when there’s not enough relevant data available. Imagine trying to teach someone about a topic using only a few examples — they might not learn it very well. It’s similar with these AI models; they need lots of examples to learn how to do a specific job well. But finding enough good quality data can be tough, particularly for less common topics or for smaller organizations that don’t have many resources. This makes it hard to customize these AI models for specific needs if you don’t have enough data.

‍

The synchronization between few-shot learning and contrastive learning:

‍

Few-shot learning and contrastive learning are two innovative approaches in the field of machine learning, each addressing unique challenges in training AI models.

Few-shot learning is focused on training models with a very limited amount of data. Traditional machine learning algorithms typically require large datasets to learn effectively. However, in many real-world scenarios, such a wealth of data is not available. Few-shot learning algorithms are designed to overcome this limitation. They do so by leveraging prior knowledge — either from similar tasks or from a subset of the data — to make accurate predictions with only a few examples. This approach is particularly useful in specialized fields where data is scarce.

‍

An example of basic few-shot learning

Contrastive learning, on the other hand, is a technique used primarily in unsupervised learning, particularly for learning efficient representations of data. It works by teaching a model to understand the differences and similarities between pairs of examples. In simple terms, the model learns by comparing things: it is trained to pull similar items closer in the representation space and push dissimilar items apart. This approach is highly effective in tasks like image and speech recognition, where understanding the nuanced differences between inputs is crucial.

A figure shows us the impact of fine-tuning a sentence transformer on its embedding space using contrastive learning

When combined, few-shot and contrastive learning can be particularly powerful. Contrastive learning can be used to create rich, detailed representations of data, which few-shot learning algorithms can then utilize to make accurate predictions with minimal examples. This synergy allows for the development of robust models capable of learning effectively from limited data, a major advantage in fields where acquiring large datasets is challenging or impossible.

SetFit :Efficient Fine-Tuning of Sentence Transformers:

‍

SetFit is a tool that helps in fine-tuning sentence transformers, making it easier to build classification systems. Here’s how it works:

Fine-Tuning Made Easy: SetFit allows you to adjust pre-trained sentence transformers to better suit your specific needs, even if you don’t have a lot of data. This is great for creating models that understand and categorize text in the way you want.
Better Classification: With SetFit, you can train your sentence transformer to be more accurate in classifying texts into different categories. Whether you’re sorting customer feedback, identifying topics in documents, or any other classification task, SetFit improves the accuracy of these processes.
Saves Time and Resources: Since SetFit works well with small datasets and doesn’t require heavy computing power, you can set up your classification system faster and without needing a lot of resources.

SetFit is a useful tool for adapting sentence transformers to your specific needs, making it easier and more efficient to build effective text classification systems.

‍

Benchmarking SetFit

‍

SetFit, though utilizing smaller models compared to other few-shot methods, achieves comparable or superior performance across various benchmarks. For instance, in the RAFT few-shot classification benchmark, the SetFit approach with Roberta (specifically the all-roberta-large-v1 model), which has 355 million parameters, surpasses the results of both PET and GPT-3. It ranks slightly below the average human performance and the T-few model, which is 30 times larger with 11 billion parameters. Impressively, SetFit exceeds human baseline performance in 7 out of the 11 RAFT tasks, showcasing its effectiveness despite its relatively smaller model size.

‍

Credits to : https://huggingface.co/blog/setfit

Fast training and inference

‍

Credits to : https://huggingface.co/blog/setfit

SetFit stands out for its high accuracy with smaller models, leading to rapid training speeds and significantly lower costs. For example, training SetFit on an NVIDIA V100 GPU with only 8 labeled examples takes a mere 30 seconds, costing about $0.025. In contrast, training the larger T-Few 3B model on an NVIDIA A100 takes about 11 minutes and costs roughly $0.7 for the same task, which is 28 times more expensive. Notably, SetFit is versatile enough to run on a single GPU, like those available on Google Colab, and it can even be trained on a CPU in just a few minutes. Despite its speed, SetFit maintains comparable performance to larger models. Additionally, when it comes to inference and model distillation, SetFit can achieve speed-ups of up to 123 times, demonstrating its impressive efficiency and cost-effectiveness.

‍

How SetFit Works:

‍

‍

In this architecture, SetFit operates in a streamlined two-stage process aimed at enhancing sentence transformers for classification tasks.

In the first stage, “ST Fine tuning,” the process begins with few-shot training data. From this data, SetFit generates sentence pairs, which are instrumental in understanding the relationships and context within the text. These pairs are then used to fine-tune a pre-trained sentence transformer (ST), allowing the model to adjust to the specific nuances of the training data.

The second stage, “Classification head training,” takes the fine-tuned sentence transformer and applies it to encode sentences, effectively transforming them into sentence embeddings. These embeddings capture the essence of the text data in a form that’s suitable for classification. Finally, a classification head is trained using these embeddings. This classification head is a specialized component that learns to categorize the embedded sentences into predefined classes, thereby completing the process of preparing the model to accurately perform classification tasks.

Through this two-stage approach, SetFit efficiently adapts sentence transformers to specific classification challenges, utilizing minimal training data to achieve high accuracy.

‍

Let’s talk about generating sentence pairs:

‍

In SetFit, generating sentence pairs is a critical step for fine-tuning sentence transformers for specific tasks. The process involves creating pairs of sentences from the training data that are either similar or dissimilar. This is done by pairing a sentence with another sentence that shares a similar meaning (a positive pair) or with one that has a different meaning (a negative pair). By doing this, SetFit teaches the sentence transformer to understand the nuances of similarity and difference within the context of the given task. These sentence pairs are then used to fine-tune the transformer so that it can produce more accurate embeddings for the classification task at hand, effectively adapting the model with just a small amount of data.

‍

‍

SetFit Iter 1.

Traditionally, text classification involves two stages: first, a sentence transformer (ST) generates embeddings from text, then a classifier model is trained on these embeddings. However, SetFit integrates these stages by directly fine-tuning the ST with a classification layer. This approach reduces the overall complexity and computational overhead. SetFit also leverages contrastive learning with smaller datasets for rapid fine-tuning and efficient inference, providing a diverse array of STs for various NLP tasks.

‍

Let’s dive into building our SetFit model:

‍

This code snippet demonstrates how to fine-tune a SetFit model for text classification using Python libraries such as setfit, optuna, and sentence_transformers. Here’s a breakdown of the code:

Training :


# Install necessary packages
!pip install setfit
!pip install optuna

# Import required libraries
from sklearn.model_selection import train_test_split
from setfit import TrainingArguments, Trainer, sample_dataset, SetFitModel
from sentence_transformers import SentenceTransformer
from datasets import Dataset
from sklearn.preprocessing import LabelEncoder
import pandas as pd
from transformers import EarlyStoppingCallback
from sklearn.linear_model import LogisticRegression
from optuna import Trial
from typing import Dict, Union, Any

# Load the dataset from a CSV file
data = pd.read_csv("data.csv")

# Encode categorical labels into numerical form
labelencoder = LabelEncoder()
encoded_Y = labelencoder.fit_transform(data['category'])
data['category_encoded'] = encoded_Y

# Split the data into training and testing sets
X = data[["X_col"]]
Y = data[["category_encoded"]]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
train = pd.concat([X_train, Y_train], axis=1)
test = pd.concat([X_test, Y_test], axis=1)

# Get unique labels
label = train.category_encoded.unique().tolist()

# Create Dataset objects from the training and testing data
train_dataset = Dataset.from_pandas(train)
test_dataset = Dataset.from_pandas(test)

# Create a balanced subset of the training data
train_dataset = sample_dataset(train_dataset, label_column="category_encoded", num_samples=8)

# Load a pre-trained Sentence Transformer model as the feature extractor
model_body = SentenceTransformer("paraphrase-mpnet-base-v2")

# Choose Logistic Regression as the classification head
model_head = LogisticRegression()

# Create a SetFit model, combining the feature extractor and classification head
model = SetFitModel(model_body, model_head)

# Define training arguments including batch size, number of epochs, and end-to-end training
args = TrainingArguments(
    batch_size=32,
    num_epochs=2,
    end_to_end=True
)

# Create a Trainer object for training the model
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    column_mapping={"X_col": "text", "category_encoded": "label"}
)

# Train the model
trainer.train()

# Evaluate the model on the test dataset
trainer.evaluate(test_dataset)

Installation:

‍

The setfit and optuna packages are installed. SetFit is likely the framework for sentence embedding and classification, while Optuna is used for hyperparameter optimization, though it’s not directly applied in the given code.

‍

1.Data Preparation:

‍

The code starts by loading a dataset from a CSV file, data.csv.It uses LabelEncoder from scikit-learn to encode categorical labels into numerical form, which is necessary for machine learning models to process.The train_test_split function splits the data into training and test sets, with 20% of the data reserved for testing.The training and test sets are then combined with their respective labels.

‍

2. SetFit Model Setup:

‍

The sample_dataset function is used to create a balanced subset of the training data, ensuring that there are an equal number of samples for each label.A pre-trained sentence transformer model, paraphrase-mpnet-base-v2, is loaded to serve as the base of the SetFit model.A LogisticRegression model is chosen as the classification head.The SetFitModel is then instantiated with the sentence transformer and logistic regression model.

‍

3. Training:

‍

TrainingArguments are defined to set the batch size, number of epochs, and an end_to_end flag, which suggests that the training will include both the embedding model and the classifier.A Trainer is created with the defined model, training arguments, training dataset, and a column mapping that specifies which columns in the dataset correspond to the text and labels.The train() method of the Trainer class is called to start the training process.

‍

4. Evaluation:

‍

Finally, the evaluate() method is used to assess the model's performance on the test dataset.Let’s move on now to parameter tuning.

Define a function to initialize the SetFit model with hyperparameters


# Define a function to initialize the SetFit model with hyperparameters
def model_init(params: Dict[str, Any]) -> SetFitModel:
    params = params or {}
    max_iter = params.get("max_iter", 100)
    solver = params.get("solver", "liblinear")
    params = {
        "head_params": {
            "max_iter": max_iter,
            "solver": solver,
        }
    }
    return SetFitModel.from_pretrained("BAAI/bge-small-en-v1.5", **params)

# Define the hyperparameter space for optimization
def hp_space(trial: Trial) -> Dict[str, Union[float, int, str]]:
    return {
        "body_learning_rate": trial.suggest_float("body_learning_rate", 1e-6, 1e-3, log=True),
        "num_epochs": trial.suggest_int("num_epochs", 1, 3),
        "batch_size": trial.suggest_categorical("batch_size", [16, 32, 64]),
        "seed": trial.suggest_int("seed", 1, 40),
        "max_iter": trial.suggest_int("max_iter", 50, 300),
        "solver": trial.suggest_categorical("solver", ["newton-cg", "lbfgs", "liblinear"]),
        "end_to_end": True,
    }

# Create a Trainer for hyperparameter optimization
trainer = Trainer(
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    model_init=model_init,
    column_mapping={"descriptionTransaction": "text", "category_encoded": "label"}
)

# Perform hyperparameter search
best_run = trainer.hyperparameter_search(direction="maximize", hp_space=hp_space, n_trials=10)

# Apply the best hyperparameters to the final model
trainer.apply_hyperparameters(best_run.hyperparameters, final_model=True)

# Train the final model
trainer.train()

# Evaluate the final model and collect metrics
metrics = trainer.evaluate()

1.SetFit Model Function (model_init):

‍

This function initializes the SetFit model with specific hyperparameters.It takes a dictionary of hyperparameters (params) as input, where max_iter and solver are used to configure the logistic regression classification head.The function returns a SetFit model initialized with the specified hyperparameters.

‍

2. Hyperparameter Space Definition (hp_space):

‍

This function defines the hyperparameter search space for optimization using Optuna.It specifies ranges or choices for various hyperparameters, such as body_learning_rate, num_epochs, batch_size, seed, max_iter, solver, and end_to_end.Optuna will explore different combinations of these hyperparameters during the optimization process.

‍

3. Trainer Initialization (trainer):

‍

The Trainer is created to manage the entire fine-tuning and hyperparameter optimization process.It is configured with the training and evaluation datasets (train_dataset and test_dataset), the model_init function for model initialization, and the column mapping to specify which columns in the dataset correspond to text and labels.

‍

4. Hyperparameter Optimization (best_run):

‍

The trainer.hyperparameter_search() method is used to perform hyperparameter optimization.It takes several parameters:direction="maximize": This specifies that the optimization aims to maximize a certain metric (e.g., accuracy).hp_space=hp_space: It uses the hyperparameter space defined earlier.n_trials=10: It specifies the number of optimization trials to run (in this case, 10 trials).After optimization, it returns the best set of hyperparameters in best_run.

‍

5. Apply Best Hyperparameters and Train (trainer.apply_hyperparameters and trainer.train):

‍

The best hyperparameters from the optimization are applied to the final model using trainer.apply_hyperparameters.The final model is then trained using trainer.train().

‍

6. Evaluate the Final Model (metrics):

‍

Finally, the code evaluates the performance of the fine-tuned model on the test dataset and collects metrics.These metrics could include accuracy, precision, recall, F1-score, etc., depending on the specific classification task.

‍

Conclusion :

SetFit represents a groundbreaking approach to text classification and fine-tuning. By seamlessly integrating sentence transformers and classification tasks into a unified process, it simplifies and accelerates model development. With efficient training and diverse pre-trained models, SetFit offers a promising avenue for creating high-performance text classifiers while reducing complexity and resource demands. This innovative framework is poised to advance natural language processing tasks, making them more accessible and effective for a wide range of applications.

Setfit paper: https://arxiv.org/abs/2209.11055

Hugging face blog about Setfit : https://huggingface.co/blog/setfit

Thanks for reading! Feel free to reach out if you found this post interesting or if it helped you out in any way.

‍