Inference for text-embeddings in Python
An open-source GPU cluster manager for running LLMs
Nvidia Framework for LLM Inference
NVIDIA Framework for LLM Inference(Transitioned to TensorRT-LLM)
Get up and running with Llama 3, Mistral, Gemma, and other large language models.
A method designed to enhance the efficiency of Transformer models
FlexLLMGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexLLMGen allows high-throughput generation by IO-efficient offloading, compression, and large effective batch sizes.
Your email address will not be published. Required fields are marked *
Comment *
Name *
Email *
Website
Captcha: 15 + 17 = ?*
Save my name, email, and website in this browser for the next time I comment.
An open-source GPU cluster manager for running LLMs