Roadmap

Project Roadmap: OCP MX Streaming MAC Unit

This document tracks the evolution from a standalone MAC unit to a fully integrated RISC-V vector accelerator.

1. Bit-Serial Evolution (Tiny-Serial)

The goal is to achieve an ultra-minimal footprint (< 500 gates) by processing data one bit at a time, inspired by the SERV core.

  • Step 4.1: [Datapath] Serial Aligner & Accumulator: Implement individual modules (fp8_aligner_serial, accumulator_serial).

  • Step 4.2: [Datapath] Serial LNS Multiplier: Implement fp8_mul_serial_lns.

  • Step 4.3: [Integration] Serial Input Buffering: Implement 8-bit shift registers to feed the serial datapath.

  • Step 4.4: [Integration] Multiplier Swap: Connect fp8_mul_serial_lns in the top-level serial path.

  • Step 4.5: [Integration] Aligner Swap: Connect fp8_aligner_serial and align timing.

  • Step 4.6: [Integration] Accumulator Swap: Replace the parallel accumulator in the serial path.

  • Step 4.7: [Integration] Serial-to-Parallel Handoff: Connect the serial accumulator’s parallel output to the top-level result capture.

  • Step 4.8: [Verification] Serial Parity: Verify functional parity between parallel and serial variants.

  • Phase C: Serial Integration (Advanced): Swap format, rounding, and metadata registers to serial shift registers for area optimization.

2. RISC-V & ISA Integration

Integration with the SERV bit-serial CPU and compliance with the ZvfofpXmin concept.

  • Step 5.1.1: [ISA] Format & Scale Instructions: Implement MX.SETFMT and MX.LOADS. (details)

  • Step 5.1.2: [ISA] MAC Instruction: Implement the packed MX.MAC instruction. (details)

  • Step 5.1.3: [ISA] Read Instruction: Implement MX.READ for accumulator retrieval. (details)

  • Step 5.2.1: [CSR] vmxfmt Definition: Implement the custom CSR bitfields and rounding control logic. (details)

  • Step 5.2.2: [Integration] SERV CSR Bridge: Integrate CSR access via SERV’s extension interface. (details)

  • VRF-to-Stream Bridge: Hardware shim to automate the 41-cycle OCP protocol from the Vector Register File. (details)

  • Tightly-Coupled Snooping: Optimize area by snooping SERV’s internal data streams directly. (details)

  • RVV 1.0 Compliance: Support vstart and vl for standard vector compliance. (details)

3. Verification & Benchmarking

  • Phase D: Benchmarking: Perform gate-level power profiling and side-by-side area comparisons.

  • Physical Verification: Functional verification on FPGA (HIL) and silicon validation (Tiny Tapeout demo board).

  • LLM Serving Benchmarks: Benchmark the system using vLLM methodologies for real-world utility.


Completed Milestones

Architectural Refactoring & Infrastructure Prep

  • Step 8.1: [Refactor] Decouple Output Multiplexer

  • Step 8.2: [Refactor] Standardize Probing Interface

  • Step 8.3: [Refactor] Accumulator Port Expansion

Numerical Precision & FP32 Compliance

  • Step 9: [Infra] Parameterize Datapath Widths (40-bit upgrade)

  • Step 10: [Datapath] 16-bit Fractional Alignment

  • Step 11: [F2F] Leading Zero Count (LZC40) Module

  • Step 12-23: [F2F] Sign-Magnitude to Assembly stages

  • Step 24: [F2F] Fixed-to-Float Wrapper

  • Step 25-26: [Integration] Protocol Update & Output Mux

  • Step 27-28: [Verification] Cocotb Reference Model & Compliance Validation


Last updated: March 2025