Roadmap
Project Roadmap: OCP MX Streaming MAC Unit
This document tracks the evolution from a standalone MAC unit to a fully integrated RISC-V vector accelerator.
1. Bit-Serial Evolution (Tiny-Serial)
The goal is to achieve an ultra-minimal footprint (< 500 gates) by processing data one bit at a time, inspired by the SERV core.
Step 4.1: [Datapath] Serial Aligner & Accumulator: Implement individual modules (
fp8_aligner_serial,accumulator_serial).Step 4.2: [Datapath] Serial LNS Multiplier: Implement
fp8_mul_serial_lns.Step 4.3: [Integration] Serial Input Buffering: Implement 8-bit shift registers to feed the serial datapath.
Step 4.4: [Integration] Multiplier Swap: Connect
fp8_mul_serial_lnsin the top-level serial path.Step 4.5: [Integration] Aligner Swap: Connect
fp8_aligner_serialand align timing.Step 4.6: [Integration] Accumulator Swap: Replace the parallel accumulator in the serial path.
Step 4.7: [Integration] Serial-to-Parallel Handoff: Connect the serial accumulator’s parallel output to the top-level result capture.
Step 4.8: [Verification] Serial Parity: Verify functional parity between parallel and serial variants.
Phase C: Serial Integration (Advanced): Swap format, rounding, and metadata registers to serial shift registers for area optimization.
2. RISC-V & ISA Integration
Integration with the SERV bit-serial CPU and compliance with the ZvfofpXmin concept.
Step 5.1.1: [ISA] Format & Scale Instructions: Implement
MX.SETFMTandMX.LOADS. (details)Step 5.1.2: [ISA] MAC Instruction: Implement the packed
MX.MACinstruction. (details)Step 5.1.3: [ISA] Read Instruction: Implement
MX.READfor accumulator retrieval. (details)Step 5.2.1: [CSR] vmxfmt Definition: Implement the custom CSR bitfields and rounding control logic. (details)
Step 5.2.2: [Integration] SERV CSR Bridge: Integrate CSR access via SERV’s extension interface. (details)
VRF-to-Stream Bridge: Hardware shim to automate the 41-cycle OCP protocol from the Vector Register File. (details)
Tightly-Coupled Snooping: Optimize area by snooping SERV’s internal data streams directly. (details)
RVV 1.0 Compliance: Support
vstartandvlfor standard vector compliance. (details)
3. Verification & Benchmarking
Phase D: Benchmarking: Perform gate-level power profiling and side-by-side area comparisons.
Physical Verification: Functional verification on FPGA (HIL) and silicon validation (Tiny Tapeout demo board).
LLM Serving Benchmarks: Benchmark the system using
vLLMmethodologies for real-world utility.
Completed Milestones
Architectural Refactoring & Infrastructure Prep
Step 8.1: [Refactor] Decouple Output Multiplexer
Step 8.2: [Refactor] Standardize Probing Interface
Step 8.3: [Refactor] Accumulator Port Expansion
Numerical Precision & FP32 Compliance
Step 9: [Infra] Parameterize Datapath Widths (40-bit upgrade)
Step 10: [Datapath] 16-bit Fractional Alignment
Step 11: [F2F] Leading Zero Count (LZC40) Module
Step 12-23: [F2F] Sign-Magnitude to Assembly stages
Step 24: [F2F] Fixed-to-Float Wrapper
Step 25-26: [Integration] Protocol Update & Output Mux
Step 27-28: [Verification] Cocotb Reference Model & Compliance Validation
Last updated: March 2025