internet, speed, FCC

Powerful artificial intelligence (AI) capabilities on hardware previously considered inadequate could be a thing of the past.

Huawei’s Computing Systems Lab has released SINQ (Sinkhorn-Normalized Quantization), an open-source technology that reduces the memory footprint of large language models (LLMs) by 60% to 70% while maintaining performance levels comparable to larger systems.

The breakthrough addresses one of AI’s most persistent challenges: Enormous computational resources required to run state-of-the-art language models.

The open-source nature of the release could accelerate adoption across the AI community, enabling researchers to experiment with larger models on standard workstations and allowing developers to build AI applications without prohibitive hardware constraints.

Models like GPT-3 typically demand high-end GPUs such as NVIDIA Corp.’s A100, which costs upwards of $19,000, or the even more expensive H100, creating substantial barriers for researchers, startups, and smaller companies.

SINQ employs two key innovations to achieve its compression without significant accuracy loss. The first, called Dual-Axis Scaling, applies separate scaling vectors to matrix rows and columns rather than using a single scaling factor. This approach distributes quantization errors more intelligently across the model.

The second technique, Sinkhorn-Knopp-Style Normalization, uses a fast algorithm to address “matrix imbalance,” which researchers identified as a critical factor affecting quantization accuracy.

Together, these methods allow SINQ to outperform other calibration-free quantization techniques like Round-To-Nearest and HQQ, sometimes matching the performance of methods requiring extensive calibration data.

The financial impact could be substantial. A model previously requiring more than 60 GB of memory and necessitating enterprise GPU infrastructure can now run on 20 GB, making it feasible to operate on a single NVIDIA GeForce RTX 4090, which retails for approximately $1,600. This represents potential hardware cost savings exceeding 90%, according to Huwaei.

Cloud computing expenses could see similar reductions. While A100 instances typically cost $3 to $4.50 per hour, comparable performance on RTX 4090 can be achieved for $1 to $1.50 per hour.

Huawei’s research team tested SINQ across multiple model architectures, including Qwen3, LLaMA, and DeepSeek. Benchmarks on WikiText2 and C4 datasets showed consistent improvements in perplexity and flip rates compared to baseline methods.

The technique also demonstrates speed advantages, quantizing models twice as fast as HQQ and more than 30 times faster than AWQ. SINQ supports non-uniform quantization schemes like NF4 and can be combined with calibration methods such as AWQ for additional accuracy gains.

Released under the Apache 2.0 license, SINQ is available on GitHub and through Hugging Face. The research team has indicated plans for tighter integration with Hugging Face Transformers and the eventual release of pre-quantized models on the Hugging Face Hub.

TECHSTRONG TV

Click full-screen to enable volume control
Watch latest episodes and shows

Tech Field Day Events

SHARE THIS STORY