README
Tiny Tapeout IHP 26a - OCP MXFP8 Streaming MAC Unit
This project implements a Streaming Multiply-Accumulate (MAC) Unit compatible with the OCP Microscaling Formats (MX) Specification (v1.0). It is designed to fit within a 2x2 Tiny Tapeout tile using the IHP SG13G2 PDK.
Attributions
This project incorporates logic and concepts from several open-source resources:
fp8_mul by Clive Chan (Arithmetic logic).
Tiny Tapeout Verilog Template (Project structure).
OCP Microscaling Formats (MX) Specification v1.0 (Numerical and Protocol Specification).
We gratefully acknowledge these contributions to the open-source hardware and AI communities.
System Context
Source: docs/diagrams/CONTEXT_DIAGRAM.PUML
Functional Block Overview
Block |
Component |
Detailed Function & Mathematics |
|---|---|---|
FSM & Control |
Cycle Counter, State Machine, Config Regs |
Orchestrates the 41-cycle protocol. Captures scales and metadata (Rounding, Overflow, LNS, MX+). Supports Short Protocol (Cycle 0) to bypass scale loading for weight-stationary kernels. |
Dual-Lane Multiplier |
Decoders, Significand Mul, Exponent Path |
Decodes elements and calculates products. Supports Mitchell’s LNS Approximation: |
Dual Aligner Stage |
Barrel Shifters, Rounding & Saturation |
Aligns products to a common 40-bit fixed-point grid. Applies Shared Scaling (\(2^{X_A - 127}\)) and MX++ Exponent Offsets. Supports RNE, TRN, CEL, and FLR rounding modes. |
Accumulator |
Signed Adder, 32-bit Accumulation Reg |
Performs 32-element summation. In Packed Mode, two 4-bit elements (FP4) are processed per cycle across dual lanes to double throughput. |
Exception & Robustness |
Sticky Registers, Output Override |
Latches |
Output Serializer |
Byte Multiplexer |
Extracts 8-bit chunks from the 32-bit accumulator for Big-Endian transmission over |
Internal Datapath
Source: docs/diagrams/DATAPATH_DIAGRAM.PUML
Protocol Description (MCU to TT/FPGA)
The MAC unit follows a 41-cycle streaming protocol (Cycles 0–40) to process a block of 32 elements.
Protocol State Machine
Operational Sequence
Cycle |
Input |
Input |
Output |
Description |
|---|---|---|---|---|
0 |
Metadata 0 |
Metadata 1 |
0x00 / Probe Data |
IDLE: Load MX+ / Debug or Start Fast Protocol. |
1 |
Scale A |
Format A / BM A |
0x00 / Probe Data |
Load Scale A, Format A, and BM Index A. |
2 |
Scale B |
Format B / BM B |
0x00 / Probe Data |
Load Scale B, Format B, and BM Index B. |
3-34 |
Element \(A_i\) |
Element \(B_i\) |
0x00 / Probe Data |
Stream 32 pairs of elements (Standard).* |
35 |
- |
- |
0x00 / Meta Echo |
Pipeline flush. |
36 |
- |
- |
0x00 |
Final Shared Scaling calculation. |
37 |
- |
- |
Result [31:24] |
Output Byte 3 (MSB). |
38 |
- |
- |
Result [23:16] |
Output Byte 2. |
39 |
- |
- |
Result [15:8] |
Output Byte 1. |
40 |
- |
- |
Result [7:0] |
Output Byte 0 (LSB). |
*Note: For 4-bit formats (MXFP4), the unit supports Vector Packing (uio_in[6]=1 in Cycle 0). This reduces the STREAM phase to 16 cycles (Cycles 3-18) and the total sequence to 25 cycles.
Metadata Mapping
Cycle 0: IDLE / Initial Metadata
UI_IN
Source: docs/diagrams/METADATA_C0_UI_BITFIELD.json
Short Protocol (
ui_in[7]=1):Immediately jumps to Cycle 3, reusing previous Scales.
Standard Start (
ui_in[7]=0):ui_in[2:0]: NBM Offset A (MX++)
Common Metadata (captured in both Standard and Short protocols):
ui_in[4:3]: LNS Mode (0: Normal, 1: LNS, 2: Hybrid)ui_in[5]: Loopback Enable (Bypasses unit;uo_out = ui_in ^ uio_in)ui_in[6]: Debug Enable (Enables probing and metadata echo)
UIO_IN
Source: docs/diagrams/METADATA_C0_UIO_BITFIELD.json
Short Protocol (
ui_in[7]=1):uio_in[2:0]is captured as Format A & B.
Standard Start (
ui_in[7]=0):uio_in[2:0]: NBM Offset B (MX++)
Common Metadata (captured in both Standard and Short protocols):
uio_in[4:3]: Rounding Mode (0: TRN, 1: CEL, 2: FLR, 3: RNE)uio_in[5]: Overflow Mode (0: SAT, 1: WRAP)uio_in[6]: Packed Mode (1: Enable Vector Packing for FP4/MXFP4)uio_in[7]: MX+ Enable (1: Enable MX+ extensions)
Cycle 1: Configuration Byte (uio_in)
Source: docs/diagrams/OCP_MX_CONFIG_BITFIELD.json
ui_in[7:0]: Scale Auio_in[2:0]: Format A (0: E4M3, 1: E5M2, 2: E3M2, 3: E2M3, 4: E2M1, 5: INT8, 6: INT8_SYM)uio_in[7:3]: BM Index A (MX+)
Cycle 2: Scale B / MX+ Metadata
Source: docs/diagrams/METADATA_C2_UIO_BITFIELD.json
ui_in[7:0]: Scale Buio_in[2:0]: Format B (Enabled ifSUPPORT_MIXED_PRECISION=1)uio_in[7:3]: BM Index B (MX+)
Debugging Output
When enabled via ui_in[6] in Cycle 0, the uo_out[7:0] port provides real-time observability into the unit’s internal state during the phases that are normally silent (Cycles 0-35).
Enable: Set
ui_in[6] = 1during Cycle 0.Probe Selection: Set
uio_in[3:0]during Cycle 0 to select the internal signal to monitor.Cycles 0-34 (Standard) or 0-18 (Packed):
uo_outoutputs the selected Probe Data (e.g., Accumulator MSB, Multiplier outputs, FSM state).Cycle 35 (Standard) or 19 (Packed):
uo_outoutputs a Metadata Echo, confirming the captured configuration.
For a full list of available probes and the metadata echo bit-mapping, see DEBUG_TT.md.
MicroPython Example (TT DevKit)
You can run a single MAC operation on the Tiny Tapeout DevKit using the onboard RP2040 or RP2350 with MicroPython. The following script performs a 32-element dot product of \(1.0 \times 1.0\) with no scaling.
Tiny Tapeout DevKit Pin Mapping
Signal |
RP2040 (v2.0/v3.1) |
RP2350 (v3.2) |
|---|---|---|
|
GPIO 0-7 |
GPIO 17-24 |
|
GPIO 8-15 |
GPIO 33-40 |
|
GPIO 16-23 |
GPIO 25-32 |
|
GPIO 24 |
GPIO 16 |
|
GPIO 25 |
GPIO 14 |
|
GPIO 26 |
GPIO 15 |
For the full script and advanced usage, see test/TT_MAC_RUN.PY.
OCP MX Feature Support
This implementation follows the OCP Microscaling Formats (MX) Specification (v1.0).
Implemented Features
Multiple Element Formats:
MXFP8: E4M3 (Bias 7) and E5M2 (Bias 15).
MXFP6: E3M2 (Bias 3) and E2M3 (Bias 1).
MXFP4: E2M1 (Bias 1).
MXINT8: Standard and Symmetric 8-bit signed integers.
Shared Scaling: Hardware-accelerated application of shared scales (\(X_A, X_B\)) using the UE8M0 format (8-bit unsigned biased exponent, Bias 127).
Rounding Modes: Support for all four OCP MX rounding modes:
TRN: Truncate (Towards Zero).
CEL: Ceil (Towards \(+\infty\)).
FLR: Floor (Towards \(-\infty\)).
RNE: Round-to-Nearest-Ties-to-Even.
Overflow Methods: Configurable behavior for out-of-range results:
SAT: Saturation (Clamp to Max/Min representable value).
WRAP: Wrapping (Modulo arithmetic).
Mixed-Precision Operations: Independent format selection for Operand A and Operand B within a single MAC block.
OCP MX+ (Extended Mantissa): Higher precision for “Block Max” (BM) elements by repurposing exponent bits as an extended mantissa: $\(V(A_{BM}) = S \cdot 2^{E_{max} - \text{Bias}} \cdot \left(1 + \frac{\text{concat}(E_i, M_i)}{2^{E_{bits} + M_{bits}}}\right) \cdot 2^{X_A - 127}\)$
Efficiency: 41-cycle pipelined streaming protocol with Fast Start (Scale Compression) to reuse scales/formats across consecutive blocks.
Omitted Features & Deviations
Subnormal Support: The RTL fully supports subnormal elements (denormals) for all floating-point formats, providing high numerical accuracy for small values.
Fixed Block Size: The unit is hard-coded for a block size of \(k=32\) elements.
NaN/Infinity Handling:
E5M2 fully supports IEEE-754 style Infinities and NaNs.
For other formats, the unit prioritizes saturation for out-of-range values, consistent with OCP MX “Saturation-only” modes for narrower formats.
Accumulator Precision: A 32-bit signed fixed-point accumulator is used, providing sufficient range for 32-element dot products of all supported formats.
FPGA Support
This project includes support for generating an FPGA bitstream for the Sipeed Tang Nano 4K (Gowin GW1NSR-4C).
For detailed build, flash, and test instructions, see the Tang Nano 4K Deployment & Testing Guide.
The bitstream is automatically generated by the GitHub Action defined in .github/workflows/gowin.yaml.
Pin Mapping for Tang Nano 4K
Signal |
Tang Nano 4K Pin |
Description |
|---|---|---|
|
40-39, 35-30 |
Scale A / Elements A |
|
9-7, 22, 44-41 |
Serialized Result |
|
21-16, 13, 10 |
Scale B / Elements B |
|
45 |
Onboard 27MHz Clock (Target: 20MHz for timing closure) |
|
15 |
Button S1 (Reset) |
|
14 |
Button S2 (Enable) |
Note: Pins are listed in MSB-to-LSB order where applicable.
Resources
Glossary
A comprehensive list of terms and acronyms used in this project can be found in the Project Glossary.
Compilation Options
The MAC unit is highly configurable through Verilog parameters. These can be adjusted to balance feature support against hardware area (gate count).
Hardware Parameters
Parameter |
Default |
Description |
|---|---|---|
|
32 |
Bit-width of the internal alignment datapath. |
|
24 |
Bit-width of the fixed-point accumulator. |
|
1 |
Enable support for E4M3 (MXFP8) format. |
|
0 |
Enable support for E5M2 (MXFP8) format. |
|
0 |
Enable support for E3M2 and E2M3 (MXFP6) formats. |
|
1 |
Enable support for E2M1 (MXFP4) format. |
|
0 |
Enable support for INT8 and INT8_SYM formats. |
|
0 |
Enable multiplier pipelining for higher clock frequencies. |
|
0 |
Enable advanced rounding modes (RNE, CEL, FLR). |
|
0 |
Allow different formats for Operand A and B. |
|
0 |
Enable 2x throughput for FP4 using vector packing. |
|
0 |
Enable bit-serial throughput for packed FP4 formats. |
|
0 |
Enable input buffering for FP4 formats. |
|
0 |
Enable MX+ extensions (Repurposed Exponents). |
|
1 |
Enable bit-serial multiplier core (reduces area). |
|
8 |
Bit-serial period (typically 8 for FP8). |
|
0 |
Enable OCP MX Shared Scaling logic. |
|
0 |
Use Logarithmic Number System (LNS) multiplier core. |
|
0 |
Use precise LUT-based LNS (higher area). |
Pre-defined Variants
The project includes a configuration script (scripts/configure_variant.py) to quickly switch between common profiles:
Baseline: Full feature set enabled, 40-bit aligner, 32-bit accumulator, parallel multipliers.
Light/Lite: Balanced configuration with MXFP6, Vector Packing, and MX+ disabled.
Tiny: Minimal footprint with only essential FP8 support enabled.