Flash-Attention | LLMWay – The Way To LLM

Inference Engines

Flash-Attention

A method designed to enhance the efficiency of Transformer models

GitHub

A method designed to enhance the efficiency of Transformer models

Relevant Sites

Serge 5,750

a chat interface crafted with llama.cpp for running Alpaca models. No API keys, entirely self-hosted!

Floom 44

AI gateway and marketplace for developers, enables streamlined integration of AI features into products

DeepSpeed-Mii 2,053

MII makes low-latency and high-throughput inference, similar to vLLM powered by DeepSpeed.

GPUStack 3,692

An open-source GPU cluster manager for running LLMs

Nanoflow 887

NanoFlow is a throughput-oriented high-performance serving framework for LLMs. NanoFlow consistently delivers superior throughput compared to vLLM, Deepspeed-FastGen, and TensorRT-LLM.

AI Gateway 9,346

Gateway streamlines requests to 100+ open & closed source models with a unified API. It is also production-ready with support for caching, fallbacks, retries, timeouts, loadbalancing, and can be edge-deployed for minimum latency.

Relevant Sites

Leave a Reply Cancel reply