Taalas unveils Direct-to-Silicon ASIC for Llama 8B
Taalas unveils Direct-to-Silicon ASIC for Llama 8B
Taalas, a team formed by former Tenstorrent engineers, announced a chip that embeds a model directly into silicon without external memory.
Design and performance
The company integrated model weights and architecture as the chip itself, avoiding HBM and complex packaging to simplify inference hardware design.
Performance figures reported by Taalas include 17,000 tokens per second on Llama 3.1 8B, which they say outpaces current SOTA GPUs by an order of magnitude.
- Production cost: the chip is claimed to be 20 times cheaper to produce than comparable GPU hardware.
- Power consumption: the device reportedly uses 10 times less energy than those GPUs for the same workload.
Trade-offs and flexibility
Taalas acknowledges technical compromises: baked weights are quantized to 3–6 bit precision and the demo context is limited to 1,000 tokens input and the same for output.
Although the ASIC targets a specific model family, the design retains support for LoRA adapters and a variable context window, preserving some fine-tuning flexibility.
Roadmap
The available silicon implements Llama 8B (HC1). Taalas plans to release a mid-size chip with enhanced reasoning capabilities in spring and to demonstrate a frontier model on second-generation silicon by winter.
Practical notes
Taalas reports the hardware already exists and has been demonstrated; the team frames the product as more than investor slides, while warning about the inherent architectural constraints.
The combination of high throughput, reduced cost and lower power consumption could reshape edge and on-premise inference deployment for compatible model families.