WebAssembly binding for llama.cpp - Enabling in-browser LLM inference
NVIDIA Framework for LLM Inference(Transitioned to TensorRT-LLM)
Confidently evaluate, test, and ship LLM applications with a suite of observability tools to calibrate language model outputs across your dev and production lifecycle.
a chat interface crafted with llama.cpp for running Alpaca models. No API keys, entirely self-hosted!
FlexLLMGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexLLMGen allows high-throughput generation by IO-efficient offloading, compression, and large effective batch sizes.
An open-source GPU cluster manager for running LLMs
NanoFlow is a throughput-oriented high-performance serving framework for LLMs. NanoFlow consistently delivers superior throughput compared to vLLM, Deepspeed-FastGen, and TensorRT-LLM.
Your email address will not be published. Required fields are marked *
Comment *
Name *
Email *
Website
Captcha: 17 - 19 = ?*
Save my name, email, and website in this browser for the next time I comment.
NVIDIA Framework for LLM Inference(Transitioned to TensorRT-LLM)