The Business Problem: Cost & Complexity
Foundation LLMs are defined by large computation, immense memory demands, storage requirements, and high bandwidth needs. This translates directly to exorbitant cloud computing costs and environmental impacts. QSL solves this by optimizing the entire stack to provide high-performance, low-cost inference.
QSL Solution: The GenAI Execution Engine
Our solution is a comprehensive E2E pipeline encompassing model preparation and optimized runtime, ensuring maximum efficiency from the moment a model is pulled from Hugging Face to its execution on hardware.
1. Preparation & Optimization Pipeline
- **Advanced Quantization:** We utilize cutting-edge algorithms like GPTQ, alongside Post-Training Quantization (PTQ), Quantization-Aware Training (QAT), and Knowledge Distillation-QAT (KD-QAT), to drastically reduce memory footprint and computational complexity with minimal accuracy loss.
- **Model Compression:** Advanced features like LoRA and QLoRA are integrated into our framework.
- **Graph Optimization:** Utilizing Torch fx and ONNX Graph structures, we apply techniques like Squash and Split to streamline the model graph.
- **Memory Efficiency:** We support low GPU memory requirements through **Model Sharding** using frameworks like Deepspeed and TorchAO.
2. Runtime & Hardware Acceleration
- **Custom Execution Providers:** We deploy a highly optimized, configurable computation engine written in **Chisel 3** (Scala). This engine generates scalable solutions parameterizable for Mac Array Size and BitWidth.
- **Systolic Array Architecture:** Our multiplication engine (QL MPE) is based on a highly optimized systolic array architecture, guaranteeing superior throughput for massive parallel matrix operations.
- **Hardware Agnostic Deployment:** Our engine supports seamless deployment on AMD/Intel FPGAs (Xilinx/Altera).
- **E2E Deployment:** We provide a compilation and evaluation pipeline integrated into FPGA vendor tools (like Vivado) for a direct path from Hugging Face repository to accelerated FPGA execution.