WebAssembly binding for llama.cpp - Enabling in-browser LLM inference
A high-throughput and memory-efficient inference and serving engine for LLMs.
FlexLLMGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexLLMGen allows high-throughput generation by IO-efficient offloading, compression, and large effective batch sizes.
Build your own conversational search engine using less than 500 lines of code by LeptonAI.
Gateway streamlines requests to 100+ open & closed source models with a unified API. It is also production-ready with support for caching, fallbacks, retries, timeouts, loadbalancing, and can be edge-deployed for minimum latency.
Seamlessly integrate LLMs as Python functions
Formerly langchain-ChatGLM, local knowledge based LLM (like ChatGLM) QA app with langchain.
Your email address will not be published. Required fields are marked *
Comment *
Name *
Email *
Website
Captcha: 10 - 20 = ?*
Save my name, email, and website in this browser for the next time I comment.
A high-throughput and memory-efficient inference and serving engine for LLMs.