The Bank of England published Staff Working Paper No. 1,127 by Marcus Buckmann and Ed Hill showing that smaller, locally hosted generative language models can be used effectively for text classification tasks such as sentiment analysis. The paper’s main result is that penalised logistic regression trained on embeddings from a small model can often match or exceed the performance of a large commercial model, while offering practical advantages around privacy, availability, cost and explainability. Using 17 sentence classification tasks with 2 to 4 classes, the authors compare zero-shot next-token prediction and several calibration approaches, including penalised logistic regression on embeddings from a quantised Llama 2 7B model. They find that the embedding-based approach frequently reaches GPT-4-level performance with only dozens of labelled examples per class, with many datasets requiring around 60–75 training samples per class, and that including clear task instructions around the sentence improves results. The paper also reports stable and interpretable word-level explanations for classification decisions and tests robustness across prompt variants, model size and quantisation. As a staff working paper, it is published as research in progress to elicit comments and debate and does not represent Bank of England policy.
Bank of England 2025-05-23
Bank of England working paper finds small local LLM embeddings with penalised logistic regression can match or outperform GPT-4 on text classification
The Bank of England's Staff Working Paper No. 1,127 shows smaller, locally hosted generative language models can effectively perform text classification tasks like sentiment analysis, offering privacy, cost, and explainability benefits. Penalised logistic regression on embeddings from a small model can match or exceed large commercial models' performance, often achieving GPT-4-level results with minimal training samples. This research aims to provoke discussion and does not reflect official Bank of England policy.