Show HN: Tiny-vLLM –C++およびCUDAで実装された高性能LLM推論エンジン
## Japanese Translation:
The text introduces `tiny-vllm`, which is a lightweight C++ inference engine constructed as a smaller sibling of `vLLM` using CUDA. It is designed to serve both as a practical server implementation and an educational resource, providing complete source code with no external dependencies beyond specific Linux tools and the `nlohmann/json` header library for parsing Safetensors files. The project targets the `Llama 3.2 1B Instruct` model (commit `898999bd...`) and loads weights in `bfloat16` format (`__nv_bfloat16`) to balance precision and memory efficiency. Developed on Linux kernel 6.19.8 with CUDA Toolkit 13.1, GCC 15.2.1, running on an AMD Ryzen 7 CPU and NVIDIA RTX 5090 GPU, the engine handles Safetensors structures comprising an 8-byte header size, a JSON header detailing tensor metadata, and raw data blocks.
Inference follows a specific computational sequence: tokenization, embeddings, RMSNorm, residual connections, RoPE positional embeddings, attention (GQA), SiLU activation, and finally the Feed Forward Network (MLP). A key technical challenge addresses the GPU's 1024 thread block limit by adapting CUDA kernel designs to handle embedding dimensions of 2048, often processing multiple numbers per thread. To optimize performance, it utilizes `cublasGemmEx` with transposition tricks (`CUBLAS_OP_T`, `CUBLAS_OP_N`) for efficient matrix multiplication on row-major data. For batched processing, the system employs both static batching and continuous batching via `PagedAttention`, effectively managing the `KV cache` to avoid recomputing Key/Value projections. Ultimately, this project serves as a "just-in-time" learning tool, allowing developers to master linear algebra and CUDA concepts directly within the code implementation.