Inference for text-embeddings in Python
Data integration platform for LLMs.
FlexLLMGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexLLMGen allows high-throughput generation by IO-efficient offloading, compression, and large effective batch sizes.
WebAssembly binding for llama.cpp - Enabling in-browser LLM inference
Use ChatGPT On Wechat via wechaty
A high-throughput and low-latency inference and serving framework for LLMs and VLs
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
Your email address will not be published. Required fields are marked *
Comment *
Name *
Email *
Website
Captcha: 16 - 20 = ?*
Save my name, email, and website in this browser for the next time I comment.
Data integration platform for LLMs.