Maximizing LLM inference compute with DeepMind & UC Berkeley

Published:

Are you looking to stay updated on industry-leading AI coverage? Join our daily and weekly newsletters for the latest updates and exclusive content. Learn More

With the high costs and slow speed of training large language models (LLMs), there is an ongoing discussion about the impact of spending more compute cycles on inference to improve LLM performance without retraining.

sajdhasd

A recent study by researchers at DeepMind and the University of California, Berkeley delves into ways to enhance LLM performance by strategically allocating compute resources during inference. Their findings, outlined in a new research paper, suggest that optimizing inference-time compute usage can lead to significant performance improvements without the necessity for larger models or extensive pre-training.

The tradeoff between inference-time and pre-training compute

The conventional method of enhancing LLM performance involves scaling up model size and pre-training compute. However, this approach has its limitations, as larger models are costly to train and require more resources to operate, making them challenging to deploy in various settings, including resource-constrained devices.

An alternative approach is to increase compute during inference to enhance the accuracy of LLM responses on challenging prompts. This method allows for the deployment of smaller LLMs while still achieving comparable performance to larger, more computationally expensive models.

The key question is, given a fixed amount of inference-time compute for an LLM, how can optimal performance be achieved through various inference methods, and how does it compare to a larger pre-trained model?

The most common method for scaling test-time computation is best-of-N sampling, where the model generates N outputs simultaneously and selects the most accurate response as the final answer. However, there are other approaches to leverage inference-time compute for improving LLMs, such as iterative revision of responses in sequential steps or adjusting the verification mechanism for selecting the best response. Combining parallel and sequential sampling, along with different verification strategies and search algorithms, can offer a diverse range of optimization strategies for inference-time.

Parallel vs sequential revision (source: arXiv)

In order to determine the optimal inference-time strategy, the researchers define a “test-time compute-optimal scaling strategy” as the approach that selects hyperparameters corresponding to a specific test-time strategy for maximal performance benefits on a given prompt during testing.

“Ideally, test-time compute should adjust the distribution to generate better outputs than simply sampling from the LLM itself,” as stated by the researchers.

Different ways to use inference-time compute

The researchers explored two key strategies for leveraging inference-time compute to enhance LLM performance. The first strategy involves modifying the proposal distribution, which is the process through which the LLM generates responses. This can be accomplished by fine-tuning the LLM to iteratively revise its answers for complex reasoning-based tasks.

The second strategy focuses on optimizing the verifier, the mechanism used to select the best answer from the generated responses. This optimization can involve training a process-based reward model to evaluate the correctness of individual steps in an answer.

The researchers conducted experiments using both methods on the challenging MATH benchmark with PaLM-2 models to assess their effectiveness.

“Through both approaches, we found that the success of a specific test-time compute strategy is heavily dependent on the nature of the problem at hand and the base LLM being used,” noted the researchers.

For less complex problems where the base LLM can produce reasonable responses, refining the initial answer iteratively proved more effective than generating multiple parallel samples. On the other hand, for more challenging problems requiring exploration of diverse solution strategies, resampling multiple responses in parallel or utilizing tree-search with a process-based reward model demonstrated greater effectiveness.

Different answer verification strategiesDifferent answer verification strategies (source: arxiv)

This discovery emphasizes the importance of adopting an adaptive “compute-optimal” strategy for scaling test-time compute. The specific approach for utilizing test-time compute should be chosen based on the prompt to make optimal use of additional computation.

By appropriately allocating test-time compute, the researchers significantly enhanced performance, surpassing the best-of-N baseline while using only approximately 25% of the computation.

Balancing test-time compute with pre-training compute

The researchers also explored whether test-time computation can serve as a substitute for additional pre-training. They compared the performance of a smaller model with additional test-time compute to a model 14 times larger with more pre-training.

For simpler and moderately difficult questions, the smaller model with extra test-time compute exhibited comparable performance to the larger pre-trained model.

“This finding suggests that, in certain scenarios, focusing on pretraining smaller models with less compute and then applying test-time compute can be more effective than solely scaling pretraining,” explained the researchers.

However, for the most challenging questions, additional pre-training compute proved more effective, indicating that current approaches to scaling test-time compute may not always be a complete substitute for scaling pre-training.

The researchers propose several future research directions, including exploring advanced strategies that combine various revision and search techniques and developing more efficient methods for estimating question difficulty.

“Overall, our study suggests that, even with a basic methodology, scaling up test-time computation can be more advantageous than scaling up pretraining. As test-time strategies evolve, further enhancements can be achieved,” concluded the researchers. “In the long run, this points towards a future where fewer FLOPs are spent during pretraining, and more FLOPs are allocated to inference.”

VB Daily

Stay in the know! Get the latest news in your inbox daily

Thanks for subscribing. Check out more VB newsletters here.

An error occurred.

FAQs

How can inference-time compute improve the performance of large language models?

Inference-time compute can enhance the accuracy of responses generated by large language models on challenging prompts without the need for extensive retraining, leading to substantial performance gains.

What are some strategies for optimizing the use of inference-time compute?

Strategies include modifying the proposal distribution, optimizing the verifier, utilizing best-of-N sampling, sequential revision steps, and exploring various verification mechanisms and search algorithms.

Can test-time computation serve as a substitute for additional pre-training in large language models?

While additional test-time computation can be effective for simpler and moderately difficult questions, more challenging scenarios may still benefit from additional pre-training compute, highlighting the importance of balancing both aspects.


Credit: venturebeat.com

Related articles

You May Also Like