arxiv:2601.02875

Revisiting Data Compression with Language Modeling

Published on Jan 6

Authors:

Abstract

Large language models demonstrate superior compression performance on text-dominant data while achieving state-of-the-art rates on the enwik9 dataset without additional training.

AI-generated summary

In this report, we investigate the potential use of large language models (LLM's) in the task of data compression. Previous works have demonstrated promising results in applying LLM's towards compressing not only text, but also a wide range of multi-modal data. Despite the favorable performance achieved, there still remains several practical questions that pose a challenge towards replacing existing data compression algorithms with LLM's. In this work, we explore different methods to achieve a lower adjusted compression rate using LLM's as data compressors. In comparison to previous works, we were able to achieve a new state-of-the-art (SOTA) adjusted compression rate of around 18% on the enwik9 dataset without additional model training. Furthermore, we explore the use of LLM's in compressing non-English data, code data, byte stream sequences. We show that while LLM's excel in compressing data in text-dominant domains, their ability in compressing non-natural text sequences still remain competitive if configured in the right way.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2601.02875

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.02875 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.02875 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.02875 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.