Towards the end of September 2022, Georgi Gerganov started work on the GGML library, a C library implementing tensor algebra. Gerganov developed the library with the intention of strict memory management and multi-threading. The creation of GGML was inspired by Fabrice Bellard's work on LibNC.[8]
Before llama.cpp, Gerganov worked on a similar library called whisper.cpp which implemented Whisper, a speech to text model by OpenAI.[9]
Development
llama.cpp began development in March 2023 by Georgi Gerganov as an implementation of the Llama inference code in pure C/C++ with no dependencies. This improved performance on computers without GPU or other dedicated hardware, which was a goal of the project.[3][10][11] llama.cpp gained traction with users who lacked specialized hardware as it could run on just a CPU including on Android devices.[10][12][13] While initially designed for CPUs, GPU inference support was later added.[14] As of August 2025 it has more than 85,000 stars on GitHub.[15]
In March 2024 Justine Tunney introduced new optimized matrix multiplication kernels for x86 and ARM CPUs, improving prompt evaluation performance for FP16 and 8-bit quantized data types.[16] These improvements were committed upstream to llama.cpp.[16] Tunney also created a tool called llamafile that bundles models and llama.cpp into a single file that runs on multiple operating systems via the Cosmopolitan Libc library also created by Tunney which allows C/C++ to be more portable across operating systems.[16]
The GGUF (GGML Universal File)[26] file format is a binary format that stores both tensors and metadata in a single file, and is designed for fast saving, and loading of model data.[27] It was introduced in August 2023 by the llama.cpp project to better maintain backwards compatibility as support was added for other model architectures.[14][28] It superseded previous formats used by the project such as GGML.
GGUF files are typically created by converting models developed with a different machine learning library such as PyTorch.[27]
Design
The format focuses on quantization, the act of reducing precision in the model weights. This can lead to reduced memory usage, and increased speed at the expense of lower model accuracy.[29][28]
GGUF supports 2-bit to 8-bit quantized integer types;[30] common floating-point data formats such as float32, float16, and bfloat16; and 1.56 bit quantization.[5]
This file format contains information necessary for running a GPT-like language model such as the tokenizer vocabulary, context length, tensor info and other attributes.[31]
Tensors info block, containing tensor_count values
Variable
uint8_t tensor_data[], weight bits block
Metadata block
// example metadatageneral.architecture:'llama',general.name:'LLaMA v2',llama.context_length:4096,...,general.file_type:10,// (typically indicates quantization level, here "MOSTLY_Q2_K")tokenizer.ggml.model:'llama',tokenizer.ggml.tokens:['<unk>','<s>','</s>','<0x00>','<0x01>','<0x02>','<0x03>','<0x04>','<0x05>','<0x06>','<0x07>','<0x08>',...],...
Tensors info block
// n-th tensorname:GGUFstring,// ex: "blk.0.ffn_gate.weight"n_dimensions:UINT32,// ex: 2dimensions:UINT64[],// ex: [ 4096, 32000 ]type:UINT32,// ex: 10 (typically indicates quantization level, here "GGML_TYPE_Q2_K")offset:UINT64// starting position within the tensor_data block, relative to the start of the block// (n+1)-th tensor...
^ abRajput, Saurabhsingh; Sharma, Tushar (4 June 2024). "Benchmarking Emerging Deep Learning Quantization Methods for Energy Efficiency". 2024 IEEE 21st International Conference on Software Architecture Companion (ICSA-C). pp. 238–242. doi:10.1109/ICSA-C63560.2024.00049. ISBN979-8-3503-6625-9.
^Gerganov, Georgi; Nguyen, Xuan Son; Slaren (August 13, 2024). "Introduction to ggml". Huggingface.
^Kluska, Piotr; Castell´o, Adri´an; Scheidegger, Florian; I. Malossi, A. Cristiano; Quintana-Ort´ı, Enrique (June 2024). "QAttn: Efficient GPU Kernels for mixed-precision Vision Transformers"(PDF). Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.
^Jianyu, Zhang; Hengyu, Meng; Ying, Hu; Yu, Luo; Xiaoping, Duan; Corporation, Majumder Abhilash Intel (July 2024). "Run LLMs on Intel GPUs Using llama.cpp". The Parallel Universe. No. 57. Intel. pp. 34–37.
^Cabezas, Darío; Fonseca-Delgado, Rigoberto; Reyes-Chacón, Iván; Vizcaino-Imacaña, Paulina; Morocho-Cayamcela, Manuel (2024). "Integrating a LLaMa-based Chatbot with Augmented Retrieval Generation as a Complementary Educational Tool for High School and College Students". Proceedings of the 19th International Conference on Software Technologies. pp. 395–402. doi:10.5220/0012763000003753. ISBN978-989-758-706-1.
^Dong, Bo; Lin, Jun; Yu, Zhentao; Xu, Zhenzhong; Luo, Yu; Chang, Hanwen; Shen, Haihao (July 2024). "Accelerating GGUF Models with Transformers". The Parallel Universe. No. 57. Intel. pp. 28–33.