Bank of England working paper finds small local LLM embeddings with penalised logistic regression can match or outperform GPT-4 on text classification

The Bank of England published Staff Working Paper No. 1,127 by Marcus Buckmann and Ed Hill showing that smaller, locally hosted generative language models can be used effectively for text classification tasks such as sentiment analysis. The paper’s main result is that penalised logistic regression trained on embeddings from a small model can often match or exceed the performance of a large commercial model, while offering practical advantages around privacy, availability, cost and explainability. Using 17 sentence classification tasks with 2 to 4 classes, the authors compare zero-shot next-token prediction and several calibration approaches, including penalised logistic regression on embeddings from a quantised Llama 2 7B model. They find that the embedding-based approach frequently reaches GPT-4-level performance with only dozens of labelled examples per class, with many datasets requiring around 60–75 training samples per class, and that including clear task instructions around the sentence improves results. The paper also reports stable and interpretable word-level explanations for classification decisions and tests robustness across prompt variants, model size and quantisation. As a staff working paper, it is published as research in progress to elicit comments and debate and does not represent Bank of England policy.

Original source View in tracker