Super Advanced LLM Stuff (IN PROGRESS)
bitsandbytes, 4-bit quantization and QLoRA
https://arxiv.org/abs/2212.09720
Quantization methods reduce the number of bits required to represent each parameter in a model, trading accuracy for smaller memory footprints and inference latencies. However, the final model size depends on both the number of parameters of the original model and the rate of compression. For example, a 30B 8-bit model and a 60B 4-bit model have the same number of bits but may have very different zero-shot accuracies.
There are many different methods of quantization. GGML, RTN and GPTQ are such different methods of quantization. RTN is simple Round To Nearest. GPTQ is a new method published in 2022 (GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers) that is better but slower.
So it is entirely possible LLaMA with 4 bits quantization quantized with RTN suffers substantial degradation and one quantized with GPTQ does not. In other words, "LLaMA with 4 bits" is not a complete specification: one needs to specify the method of quantization.
And that overhead is massive. The amount of resources needed to run inferences on these LLMs is out of reach for all but the largest, and best funded, of organizations. Researchers and hobbyists that could fuel the next revolution in machine learning with access to these models are left out in the cold.
LLaMA promised to change all that, with options that perform as well as GPT-3 models, but can run on as little as a single GPU.
4 bits vs 8 bit vs 32FP
VRAM to run 7B, 13B, 33B
GPU vs CPU evolution
Llamacpp vs webui vs koboldcpp
Last updated