Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Zyphra Technologies, the company working on a multimodal agent system combining advanced research in next-gen state-space model architectures, long-term memory, and reinforcement learning, just released Zyda-2, an open pretraining dataset comprising 5 trillion tokens.
While Zyda-2 is five times larger than its predecessor and covers a vast range of topics, what truly sets it apart is its unique composition. Unlike many open datasets available on Hugging Face, Zyda-2 has been distilled to retain the strengths of the top existing datasets while eliminating their weaknesses.
This gives organizations a way to train language models that show high accuracy even when operating across edge and consumer devices on a given parameter budget. The company trained its Zamba2 small language model using this dataset and found it to perform significantly better than when using other state-of-the-art open-source language modeling datasets.
The move comes just a few months after the release of the original Zyda dataset, which covered a wide array of topics and domains to ensure the diversity and quality necessary for training competitive language models.
What does Zyda-2 bring to the table?
Earlier this year, as part of the effort to build highly powerful small models that could automate a range of tasks cheaply, Zyphra went beyond model architecture research to start constructing a custom pretraining dataset by combining the best permissively licensed open datasets – often recognized as high-quality within the community.
The first release from this work, Zyda with 1.3 trillion tokens, debuted in June as a filtered and deduplicated mashup of existing premium open datasets, specifically RefinedWeb, Starcoder C4, Pile, Slimpajama, pe2so, and arxiv.
At the time, Zyda performed better than the datasets it was built upon, giving enterprises a strong open option for training. But, 1.3 trillion tokens was never going to be enough. The company needed to scale and push the benchmark of performance, which led it to set up a new data processing pipeline and develop Zyda-2.
At the core, Zyphra built on Zyda-1, further improving it with open-source tokens from DCLM, FineWeb-Edu, and the Common-Crawl portion of Dolma v1.7. The original version of Zyda was created with the company’s own CPU-based processing pipeline, but for the latest version, they used Nvidia’s NeMo Curator, a GPU-accelerated data curation library. This helped them reduce the total cost of ownership by 2x and process the data 10x faster, going from three weeks to two days.
“We performed cross-deduplication between all datasets. We believe this increases quality per token since it removes duplicated documents from the dataset. Following on from that, we performed model-based quality filtering on Zyda-1 and Dolma-CC using NeMo Curator’s quality classifier, keeping only the ‘high-quality’ subset of these datasets,” Zyphra wrote in a blog post.
The work created a perfect ensemble of datasets in the form of Zyda-2, leading to improved model performance. As Nvidia noted in a separate developer blog post, the new dataset combines the best elements of additional datasets used in the pipeline with many high-quality educational samples for logical reasoning and factual knowledge. Meanwhile, the Zyda-1 component provides more diversity and variety and excels at more linguistic and writing tasks.
Distilled dataset leads to improved model performance
In an ablation study, training Zamba2-2.7B with Zyda-2 led to the highest aggregate evaluation score on leading benchmarks, including MMLU, Hellaswag, Piqa, Winogrande, Arc-Easy, and Arc-Challenge. This shows model quality improves when training with the distilled dataset as compared to training with individual open datasets.
Zyda-2 performance
“While each component dataset has its strengths and weaknesses, the combined Zyda-2 dataset can fill these gaps. The total training budget to obtain a given model quality is reduced compared to the naive combination of these datasets through the use of deduplication and aggressive filtering,” the Nvidia blog added.
Ultimately, the company hopes this work will pave the way for better quality small models, helping enterprises maximize quality and efficiency with specific memory and latency constraints, both for on-device and cloud deployments.
Teams can already get started with the Zyda-2 dataset by downloading it directly from Hugging Face. It comes with an ODC-By license which enables users to train on or build off of Zyda-2 subject to the license agreements and terms of use of the original data sources.
VB Daily
Stay in the know! Get the latest news in your inbox daily
Thanks for subscribing. Check out more VB newsletters here.
An error occurred.
Frequently Asked Questions
Q: What sets Zyda-2 apart from other datasets?
A: Zyda-2 stands out due to its unique composition and the way it has been distilled to retain strengths while eliminating weaknesses of existing datasets.
Q: How can organizations benefit from training with Zyda-2?
A: Organizations can achieve high accuracy in language models, operate effectively across different devices, and improve model performance compared to other open-source datasets.
Credit: venturebeat.com