Playground for devs to finetune & deploy LLMs
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
Formerly langchain-ChatGLM, local knowledge based LLM (like ChatGLM) QA app with langchain.
Run LLMs and batch jobs on any cloud. Get maximum cost savings, highest GPU availability, and managed execution -- all with a simple interface.
A high-throughput and memory-efficient inference and serving engine for LLMs.
A method designed to enhance the efficiency of Transformer models
FlexLLMGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexLLMGen allows high-throughput generation by IO-efficient offloading, compression, and large effective batch sizes.
Your email address will not be published. Required fields are marked *
Comment *
Name *
Email *
Website
Captcha: 10 - 10 = ?*
Save my name, email, and website in this browser for the next time I comment.
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.