Transformers Meet Classical XGB

A simple sentiment analysis experiment

Aug 31, 2025

Figure 1 : Abstract image about building a combined XGB+Embeddings classifier . Img src : https://in.pinterest.com/pin/476607573085844569/

What you can expect in this article

Overview about the problem of sentiment analysis (classifying text data)
What are embeddings model and a bird eye overview about how they work
A python code implementing basic “XGB_Classifer + Embeddings” classification system to conduct sentiment classification on Amazon Review dataset
An illustration of the code with observations about the output

The Problem

Given a set of free text (a set of sentences, usually not very long) L = {S1,S2,…,Sn} and for each sentence we have a positive and negative label, we want to build a classification system that classify each sentence as correct as possible. The main metric for this system the the Area Under Curve (Reference 1).

The classical real world application for this problem is the sentiment analysis :
Given a set of “feedbacks” where each feedback consists of a set of a “few” sentences , it will help to identify if this feedback is positive or negative. This can help, for example to predict the likeability of purchase for this product, hence control how it is ranked.

Figure 1 : Toy sample for feedback and label data

On an advanced level, analyzing feedback text can give insights about (1) factors based on which people decide how good or bad the product is, for ex. price, quality, speed of shipment (2) for each product, what the values of these factors. However, this is out of scope of this article

The Solution

This section I will divide into some background and then realization of the solution

» Background

As it is well known, Transformers (which is a special architecture type that understands text) is the backbone of most LLMs. Pretrained transformers can be a good tool to “embed” feedback text to a fixed length features vectors.

Figure 2 : schematic on how to convert n feedback text to n x d feature matrix

So how brilliant pretrained transformers convert a set of n feedback, where each feedback has variable number of tokens (m_i , 1<=i<=n), to a fixed dimensions embeddings matrix E of dimensions n x d ??

Let me give an outline of how it works, to delve further into details, see Ref #2,3

Each text T_i , 1<=i<=n , is tokenized into set m_i tokens : “I love this product” →”I” , “Love” , “This” , “Product” .
Tokens are cleaned and normalized
For each pretrained model, it has a core “token-embedding-matrix” , call it M.
M has dimensions of kxd where k is the number of most common tokens this embedding model is trained (for ex it can be 30,000 or 50,000 tokens). The embedding dimension is an arbitrary number that forms the hyperspace of the embeddings. The higher the dimension, the more complex is the assumption about how the context surrounding this word is (See Ref # 2 for more details)
Each token u_ij (the jth token in the ith piece of text, 1<=j<=m_i for each i)
is mapped to a number (using some predesigned magic function or so) that maps each token to the “row” number in the token-embedding-matrix M , u_ij → s . Then the embedding for this token is M[s,:] which is a d dimensional vector
For all m_i tokens in text T_i we have an embedding vector e1,e2,..,e_mi each of which is of dimension d . If we stach all of these embedding vectors , row wise , we end with a matrix of dimension m_i x d.
But we want to map each text T_i to a vector of dimension d not a matrix of dimension m_i x d. The solution is to apply some pooling method to squash the the matrix column wise from m_i x d to a vector d . The most straight forward approach is average pooling (Ref 3 and 4)
No we have set of n pieces of text (T_i , i=1,..,n) to an embedding matrix E of dimension n x d (As in Figure 2)

However one question arises , how this magic embedding matrix is trained ?? In summary, the embedding matrix is the “Weight” matrix used to make each word able to predict the surrounding text in the embedding hyperspace of dimension d. This training depends on

What text corpus is this embedding model using to train its embedding matrix
What is the definition of “context” used in training. For example, the context might be the surrounding words around each token
What kind of loss function is used

Figure 3 : naive example on what does it mean that a token can predict it surrounding context —Img src : Ref 2 https://petuum.medium.com/embeddings-a-matrix-of-meaning-4de877c9aa27)

In figure 3 a naive example on how we can “assume” some target to use for training the embedding model to get the matrix. Given millions of real world text , we will try to set the “encoding” of each token to help make this word able to predict the context around it , i.e. nearby words.

For more details on how Embedding Matrices are generated , check the wonderful tutorial in Ref 2.

Given the conversion from free text to numerical matrix , we can directly apply classical ML classifier for sentiment analysis. In this article I use XGBClassifier, why ? (1) Well studied and commonly used among ML practitioners (2) Well supported by python (3) Easy to understand and to extract metadata bout the model , like feature importance. (See Ref 5)

Figure 4 : A schematic showing how text embeddings can be used with classical ML classifiers like XGB for building a classification system.

» Implementation

In this part I will provide a simple working code that builds a sentiment analysis over amazon product review data

Data : Amazon products sentiment analysis data (~3M samples for training , 400K for testing). See Ref 6

Each sample has a “cleaned” text of review and two labels : Labell_1 and Label_2. Label_1 is a bad review when the rating is 1 or 2. Label_2 is a good review if rating is 4 or 5. Rating of 3 is neutral and removed from data

With that said , here is the working code
_______

import bz2
from argparse import ArgumentParser
from typing import Union
import matplotlib.pyplot as plt
import numpy as np
import torch.cuda
from sentence_transformers import SentenceTransformer
from loguru import logger
import pandas as pd
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import roc_auc_score
from tqdm import tqdm
from xgboost import XGBClassifier
from datetime import datetime


def read_amazon_data(file_path: str, n_lines: Union[int, None] = None):
    # Each line starts with a label followed by the review text
    logger.info(f"Reading lines")
    with bz2.open(file_path, mode="rt", encoding="utf-8", errors="ignore") as f:
        if n_lines is None:
            lines = f.readlines()
        else:
            lines = []
            for i, line in tqdm(enumerate(f), desc="read-lines"):
                lines.append(line)
                if i == n_lines - 1:
                    break

    lines_dicts = []
    sep = " "
    for line in lines:
        tokens = line.split(sep)
        label = tokens[0]
        sentence = sep.join(tokens[1:])
        lines_dicts.append({"label": label, "sentence": sentence})
    df = pd.DataFrame.from_records(data=lines_dicts)
    df["label"] = df["label"].apply(lambda x: 1 if x == "__label__2" else 0)
    return df


def get_xgb_emb_auc(bst: XGBClassifier, X_train: np.ndarray, X_test: np.ndarray, y_train: np.ndarray,
                    y_test: np.ndarray):
    # fit model
    bst.fit(X_train, y_train)
    # make predictions
    y_pred_proba = bst.predict_proba(X_test)[:, 1]
    auc = roc_auc_score(y_test, y_pred_proba)
    return auc


parser = ArgumentParser()
parser.add_argument("--train_data_path", type=str, required=True)
parser.add_argument("--test_data_path", type=str, required=True)
parser.add_argument("--embedding-model-name", type=str, required=True)
parser.add_argument("--n-train", type=int, required=True)
parser.add_argument("--n-test", type=int, required=True)
parser.add_argument("--dim-fraction-step", type=float, required=True)
args = parser.parse_args()
if __name__ == "__main__":
    logger.info(f"Is cuda available {torch.cuda.is_available()}")

    df_train = read_amazon_data(file_path=args.train_data_path, n_lines=args.n_train)
    df_test = read_amazon_data(file_path=args.test_data_path, n_lines=args.n_test)
    y_train = df_train["label"]
    y_test = df_test["label"]
    logger.info(f"Embedding Model = {args.embedding_model_name}")
    train_sentences = df_train["sentence"].to_list()
    test_sentences = df_test["sentence"].to_list()
    device = torch.device("cuda")
    embeddings_model = SentenceTransformer(model_name_or_path=args.embedding_model_name, device="cuda")
    bst = XGBClassifier(n_estimators=2, max_depth=2, learning_rate=1, objective='binary:logistic')
    logger.info(f"Classifier Model = {bst}")
    # get embeddings model meta data
    embeddings_dim = embeddings_model.get_sentence_embedding_dimension()
    logger.info(f"Embeddings dim = {embeddings_dim}")
    logger.info(
        f"Applying embeddings to train sentences  (n={args.n_train}) and test sentences (n={args.n_test}) sentences")

    start_timestamp = datetime.now()
    train_embeddings = embeddings_model.encode(sentences=train_sentences,show_progress_bar=True)
    end_timestamp = datetime.now()
    train_embeddings_generation_time = (end_timestamp - start_timestamp).seconds

    start_timestamp = datetime.now()
    test_embeddings = embeddings_model.encode(sentences=test_sentences,show_progress_bar=True)
    end_timestamp = datetime.now()
    test_embeddings_generation_time = (end_timestamp - start_timestamp).seconds
    # get full rank (dim) auc
    auc_data = []
    auc = get_xgb_emb_auc(bst=bst, X_train=train_embeddings, X_test=test_embeddings, y_train=y_train, y_test=y_test)
    logger.info(f"AUC for full-dim features (dim = {embeddings_dim}), = {auc}")
    auc_data.append((embeddings_dim, auc))
    dim_step = int(args.dim_fraction_step * embeddings_dim)
    target_dim = embeddings_dim - dim_step

    while target_dim > 0:
        assert target_dim < args.n_train
        assert target_dim < args.n_test
        logger.info(f"Lowering dimension from {embeddings_dim} to {target_dim}")
        svd_model = TruncatedSVD(n_components=target_dim, n_iter=7, random_state=42)
        train_embeddings_low_dim = svd_model.fit_transform(X=train_embeddings)
        test_embeddings_low_dim = svd_model.fit_transform(X=test_embeddings)
        auc = get_xgb_emb_auc(bst=bst, X_train=train_embeddings_low_dim,
                              X_test=test_embeddings_low_dim, y_train=y_train, y_test=y_test)
        logger.info(f"After lowering dimension from {embeddings_dim} to {target_dim}, auc = {auc}")
        auc_data.append((target_dim, auc))
        target_dim -= dim_step
    x, y = zip(*auc_data)
    # plot
    plt.plot(x, y, linestyle="-")  # or plt.plot(x, y) if you want a line
    plt.xlabel(f"n_features")
    plt.ylabel("classifier auc")
    plt.title(f"Embedding model={args.embedding_model_name},\n"
              f"Classifier=XGB,n_train={args.n_train},n_test={args.n_test}")
    plt.grid(True)
    plt.show()
    plt.savefig(f"xgb_embeddings_{args.embedding_model_name}.png")

_______

In summary , the code does the following

It loads the amazon train and test data
It uses the sentences transforms library (Ref 7) to load variety of pretrained embedding models, the model name is passed as a parameter
It creates a simple XGBclassifier
The pretrained models are applied for a “sample” of the training and test text. GPU is used
After having the features matrices X_train X_test and targets y_train and y_test , we train the classifier, test it against test data and calculate AUC
Extra : due to the high dimensionality , I tested compressing high-dim X_train and X_test with Singular Value Decomposition (SVD, Ref 8) , to see if lower dimension embedding can give the same AUC
SVD https://en.wikipedia.org/wiki/Singular_value_decomposition

Here is a sample run
python xgb_with_embeddings.py --train_data_path train.ft.txt.bz2 --test_data_path test.ft.txt.bz2 --embedding-model-name <MODEL_NAME> --n-train 100_000 --n-test 10_000 --dim-fraction-step 0.2

I have tried two models

all-MiniLM-L6-v2
all-MiniLM-L12-v2

Why these two models ? Both are relatively small models with which I managed to get results on my laptop

Here is a ChatGPT generated table comparing two models (See also Ref 8):

Figure 5 : Table comparing the two embedding pretrained models I have used in experiments — Link to ChatGPT chat : https://chatgpt.com/share/68b0f76a-5594-8010-944f-cd756962b958

Now let’s present the code output.

all-MiniLM-L6-v2



2025-08-29 01:43:39.705 | INFO     | __main__:<module>:78 - Embedding Model = all-MiniLM-L6-v2

2025-08-29 01:43:44.327 | INFO     | __main__:<module>:84 - Classifier Model = XGBClassifier(base_score=None, booster=None, callbacks=None,

              colsample_bylevel=None, colsample_bynode=None,

              colsample_bytree=None, device=None, early_stopping_rounds=None,

              enable_categorical=False, eval_metric=None, feature_types=None,

              feature_weights=None, gamma=None, grow_policy=None,

              importance_type=None, interaction_constraints=None,

              learning_rate=1, max_bin=None, max_cat_threshold=None,

              max_cat_to_onehot=None, max_delta_step=None, max_depth=2,

              max_leaves=None, min_child_weight=None, missing=nan,

              monotone_constraints=None, multi_strategy=None, n_estimators=2,

              n_jobs=None, num_parallel_tree=None, ...)

2025-08-29 01:43:44.327 | INFO     | __main__:<module>:87 - Embeddings dim = 384

2025-08-29 01:43:44.327 | INFO     | __main__:<module>:88 - Applying embeddings to train sentences  (n=100000) and test sentences (n=10000) sentences

Batches: 100%|██████████| 3125/3125 [05:19<00:00,  9.78it/s]

Batches: 100%|██████████| 313/313 [00:40<00:00,  7.70it/s]

2025-08-29 01:49:49.794 | INFO     | __main__:<module>:103 - AUC for full-dim features (dim = 384), = 0.712645583489681

2025-08-29 01:49:49.794 | INFO     | __main__:<module>:111 - Lowering dimension from 384 to 308

2025-08-29 01:50:05.474 | INFO     | __main__:<module>:117 - After lowering dimension from 384 to 308, auc = 0.29819621263289553

2025-08-29 01:50:05.474 | INFO     | __main__:<module>:111 - Lowering dimension from 384 to 232

2025-08-29 01:50:23.519 | INFO     | __main__:<module>:117 - After lowering dimension from 384 to 232, auc = 0.29819621263289553

2025-08-29 01:50:23.519 | INFO     | __main__:<module>:111 - Lowering dimension from 384 to 156

2025-08-29 01:50:35.937 | INFO     | __main__:<module>:117 - After lowering dimension from 384 to 156, auc = 0.29819621263289553

2025-08-29 01:50:35.937 | INFO     | __main__:<module>:111 - Lowering dimension from 384 to 80

2025-08-29 01:50:44.432 | INFO     | __main__:<module>:117 - After lowering dimension from 384 to 80, auc = 0.29819621263289553

2025-08-29 01:50:44.433 | INFO     | __main__:<module>:111 - Lowering dimension from 384 to 4

2025-08-29 01:50:46.379 | INFO     | __main__:<module>:117 - After lowering dimension from 384 to 4, auc = 0.6147143164477799

Process finished with exit code 0

all-MiniLM-L12-v2

2025-08-29 02:32:41.058 | INFO     | __main__:<module>:78 - Embedding Model = all-MiniLM-L12-v2

2025-08-29 02:32:45.620 | INFO     | __main__:<module>:84 - Classifier Model = XGBClassifier(base_score=None, booster=None, callbacks=None,

              colsample_bylevel=None, colsample_bynode=None,

              colsample_bytree=None, device=None, early_stopping_rounds=None,

              enable_categorical=False, eval_metric=None, feature_types=None,

              feature_weights=None, gamma=None, grow_policy=None,

              importance_type=None, interaction_constraints=None,

              learning_rate=1, max_bin=None, max_cat_threshold=None,

              max_cat_to_onehot=None, max_delta_step=None, max_depth=2,

              max_leaves=None, min_child_weight=None, missing=nan,

              monotone_constraints=None, multi_strategy=None, n_estimators=2,

              n_jobs=None, num_parallel_tree=None, ...)

2025-08-29 02:32:45.620 | INFO     | __main__:<module>:87 - Embeddings dim = 384

2025-08-29 02:32:45.620 | INFO     | __main__:<module>:88 - Applying embeddings to train sentences  (n=100000) and test sentences (n=10000) sentences

Batches: 100%|██████████| 3125/3125 [07:55<00:00,  6.57it/s]

Batches: 100%|██████████| 313/313 [00:58<00:00,  5.38it/s]

2025-08-29 02:41:44.385 | INFO     | __main__:<module>:103 - AUC for full-dim features (dim = 384), = 0.7219352095059413

2025-08-29 02:41:44.385 | INFO     | __main__:<module>:111 - Lowering dimension from 384 to 308

2025-08-29 02:41:57.273 | INFO     | __main__:<module>:117 - After lowering dimension from 384 to 308, auc = 0.7188832420262664

2025-08-29 02:41:57.273 | INFO     | __main__:<module>:111 - Lowering dimension from 384 to 232

2025-08-29 02:42:09.236 | INFO     | __main__:<module>:117 - After lowering dimension from 384 to 232, auc = 0.7188463989993746

2025-08-29 02:42:09.236 | INFO     | __main__:<module>:111 - Lowering dimension from 384 to 156

2025-08-29 02:42:21.646 | INFO     | __main__:<module>:117 - After lowering dimension from 384 to 156, auc = 0.7188463989993746

2025-08-29 02:42:21.646 | INFO     | __main__:<module>:111 - Lowering dimension from 384 to 80

2025-08-29 02:42:28.789 | INFO     | __main__:<module>:117 - After lowering dimension from 384 to 80, auc = 0.7188463989993746

2025-08-29 02:42:28.789 | INFO     | __main__:<module>:111 - Lowering dimension from 384 to 4

2025-08-29 02:42:30.505 | INFO     | __main__:<module>:117 - After lowering dimension from 384 to 4, auc = 0.6028136985616009

Process finished with exit code 0

Observations

AUC for 2 models almost the same (0.72) for full-dimension embeddings (dim=384)
V12 model is more robust to dimension compression than V6 model (need more investigation)
AUC value means that the classifier provides “acceptable discrimination” (see Fig. 6). Keep in mind that this is a naive PoC to test embeddings + XGB classifier

Figure 6 : Rule of thumb to interpret AUC values. Img src : **Hosmer, Lemeshow & Sturdivant (2013)**: *Applied Logistic Regression (3rd ed.).* Wiley (p.177) . https://dl.icdst.org/pdfs/files4/7751d268eb7358d3ca5bd88968d9227a.pdf

Give these results we can conclude that Embeddings + XGB_Classifier can give a decent classification system for text data.

The work done here is inspired by the tutorial in Ref 10.

References

AUC and ROC
https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
Embeddings : A Matrix of Meaning
https://petuum.medium.com/embeddings-a-matrix-of-meaning-4de877c9aa27
Pooling in Embeddings
https://medium.com/@suvasism/pooling-in-embedding-16781aacfb12
Sentence Transformer Modules
https://www.sbert.net/docs/package_reference/sentence_transformer/models.html
https://www.sbert.net/docs/package_reference/sentence_transformer/models.html#sentence_transformers.models.Pooling
XGB Classifiers
https://www.geeksforgeeks.org/machine-learning/xgbclassifier/
Amazon Review Dataset
https://www.kaggle.com/datasets/bittlingmayer/amazonreviews
Sentences Transforms Library
https://sbert.net/
Models
https://www.sbert.net/docs/sentence_transformer/pretrained_models.html
Embedding models specs
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2
Hosmer, Lemeshow & Sturdivant (2013): Applied Logistic Regression (3rd ed.). Wiley.
https://dl.icdst.org/pdfs/files4/7751d268eb7358d3ca5bd88968d9227a.pdf
Combining XGBoost and Embeddings: Hybrid Semantic Boosted Trees?
https://machinelearningmastery.com/combining-xgboost-and-embeddings-hybrid-semantic-boosted-trees/

Betaflow AI

Discussion about this post