README

Tiny Tapeout IHP 26a - OCP MXFP8 Streaming MAC Unit

This project implements a Streaming Multiply-Accumulate (MAC) Unit compatible with the OCP Microscaling Formats (MX) Specification (v1.0). It is designed to fit within a 2x2 Tiny Tapeout tile using the IHP SG13G2 PDK.

Attributions

This project incorporates logic and concepts from several open-source resources:

fp8_mul by Clive Chan (Arithmetic logic).
Tiny Tapeout Verilog Template (Project structure).
OCP Microscaling Formats (MX) Specification v1.0 (Numerical and Protocol Specification).

We gratefully acknowledge these contributions to the open-source hardware and AI communities.

System Context

System Context Diagram

Source: docs/diagrams/CONTEXT_DIAGRAM.PUML

Functional Block Overview

Block	Component	Detailed Function & Mathematics
FSM & Control	Cycle Counter, State Machine, Config Regs	Orchestrates the 41-cycle protocol. Captures scales and metadata (Rounding, Overflow, LNS, MX+). Supports Short Protocol (Cycle 0) to bypass scale loading for weight-stationary kernels.
Dual-Lane Multiplier	Decoders, Significand Mul, Exponent Path	Decodes elements and calculates products. Supports Mitchell’s LNS Approximation: $(1+m_a)(1+m_b) \approx \begin{cases} 1 + m_a + m_b & m_a+m_b \lt 1 \\\\ 2(m_a + m_b) & m_a+m_b \ge 1 \end{cases}$ Handles MX+ Extended Mantissa: $V(A_{BM}) = S \cdot 2^{E_{max} - \text{Bias}} \cdot \left(1 + \frac{\text{concat}(E_i, M_i)}{2^{E_{bits} + M_{bits}}}\right) \cdot 2^{X_A - 127}$
Dual Aligner Stage	Barrel Shifters, Rounding & Saturation	Aligns products to a common 40-bit fixed-point grid. Applies Shared Scaling ($2^{X_A - 127}$) and MX++ Exponent Offsets. Supports RNE, TRN, CEL, and FLR rounding modes.
Accumulator	Signed Adder, 32-bit Accumulation Reg	Performs 32-element summation. In Packed Mode, two 4-bit elements (FP4) are processed per cycle across dual lanes to double throughput.
Exception & Robustness	Sticky Registers, Output Override	Latches `nan_sticky` and `inf_sticky` flags. Overrides the final result with OCP special patterns if an exception occurs during the streaming block.
Output Serializer	Byte Multiplexer	Extracts 8-bit chunks from the 32-bit accumulator for Big-Endian transmission over `uo_out` during Cycles 37-40.

Internal Datapath

Internal Datapath Diagram

Source: docs/diagrams/DATAPATH_DIAGRAM.PUML

Read the documentation for project
Die Size & Area Analysis
Flip-Flop Usage Analysis
LUT and Gate Usage Analysis
Consolidated Project Roadmap
Project Concept & Detailed Roadmap
MX+ Implementation Roadmap
VMXDOTP Paper Summary
VMXDOTP SERV Integration Roadmap
RISC-V CSR Mapping Concept
Tang Nano 4K Deployment & Testing Guide
Cortex-M3 Testbench Roadmap
Cortex-M3 Testbench Guide
Local Setup Guide (WSL2)

Protocol Description (MCU to TT/FPGA)

The MAC unit follows a 41-cycle streaming protocol (Cycles 0–40) to process a block of 32 elements.

Protocol State Machine

Protocol State Machine Diagram

Source: docs/diagrams/PROTOCOL_STATES.PUML

Operational Sequence

Cycle	Input `ui_in[7:0]`	Input `uio_in[7:0]`	Output `uo_out[7:0]`	Description
0	Metadata 0	Metadata 1	0x00 / Probe Data	IDLE: Load MX+ / Debug or Start Fast Protocol.
1	Scale A	Format A / BM A	0x00 / Probe Data	Load Scale A, Format A, and BM Index A.
2	Scale B	Format B / BM B	0x00 / Probe Data	Load Scale B, Format B, and BM Index B.
3-34	Element $A_i$	Element $B_i$	0x00 / Probe Data	Stream 32 pairs of elements (Standard).*
35	-	-	0x00 / Meta Echo	Pipeline flush.
36	-	-	0x00	Final Shared Scaling calculation.
37	-	-	Result [31:24]	Output Byte 3 (MSB).
38	-	-	Result [23:16]	Output Byte 2.
39	-	-	Result [15:8]	Output Byte 1.
40	-	-	Result [7:0]	Output Byte 0 (LSB).

*Note: For 4-bit formats (MXFP4), the unit supports Vector Packing (uio_in[6]=1 in Cycle 0). This reduces the STREAM phase to 16 cycles (Cycles 3-18) and the total sequence to 25 cycles.

Metadata Mapping

Cycle 0: IDLE / Initial Metadata

UI_IN

Metadata 0 (ui_in) Diagram Source: docs/diagrams/METADATA_C0_UI_BITFIELD.json

Short Protocol (ui_in[7]=1):
- Immediately jumps to Cycle 3, reusing previous Scales.
Standard Start (ui_in[7]=0):
- ui_in[2:0]: NBM Offset A (MX++)
Common Metadata (captured in both Standard and Short protocols):
- ui_in[4:3]: LNS Mode (0: Normal, 1: LNS, 2: Hybrid)
- ui_in[5]: Loopback Enable (Bypasses unit; uo_out = ui_in ^ uio_in)
- ui_in[6]: Debug Enable (Enables probing and metadata echo)

UIO_IN

Metadata 1 (uio_in) Diagram Source: docs/diagrams/METADATA_C0_UIO_BITFIELD.json

Short Protocol (ui_in[7]=1):
- uio_in[2:0] is captured as Format A & B.
Standard Start (ui_in[7]=0):
- uio_in[2:0]: NBM Offset B (MX++)
Common Metadata (captured in both Standard and Short protocols):
- uio_in[4:3]: Rounding Mode (0: TRN, 1: CEL, 2: FLR, 3: RNE)
- uio_in[5]: Overflow Mode (0: SAT, 1: WRAP)
- uio_in[6]: Packed Mode (1: Enable Vector Packing for FP4/MXFP4)
- uio_in[7]: MX+ Enable (1: Enable MX+ extensions)

Cycle 1: Configuration Byte (`uio_in`)

Configuration Byte Diagram Source: docs/diagrams/OCP_MX_CONFIG_BITFIELD.json

ui_in[7:0]: Scale A
uio_in[2:0]: Format A (0: E4M3, 1: E5M2, 2: E3M2, 3: E2M3, 4: E2M1, 5: INT8, 6: INT8_SYM)
uio_in[7:3]: BM Index A (MX+)

Cycle 2: Scale B / MX+ Metadata

Metadata 2 (uio_in) Diagram Source: docs/diagrams/METADATA_C2_UIO_BITFIELD.json

ui_in[7:0]: Scale B
uio_in[2:0]: Format B (Enabled if SUPPORT_MIXED_PRECISION=1)
uio_in[7:3]: BM Index B (MX+)

Debugging Output

When enabled via ui_in[6] in Cycle 0, the uo_out[7:0] port provides real-time observability into the unit’s internal state during the phases that are normally silent (Cycles 0-35).

Enable: Set ui_in[6] = 1 during Cycle 0.
Probe Selection: Set uio_in[3:0] during Cycle 0 to select the internal signal to monitor.
Cycles 0-34 (Standard) or 0-18 (Packed): uo_out outputs the selected Probe Data (e.g., Accumulator MSB, Multiplier outputs, FSM state).
Cycle 35 (Standard) or 19 (Packed): uo_out outputs a Metadata Echo, confirming the captured configuration.

For a full list of available probes and the metadata echo bit-mapping, see DEBUG_TT.md.

MicroPython Example (TT DevKit)

You can run a single MAC operation on the Tiny Tapeout DevKit using the onboard RP2040 or RP2350 with MicroPython. The following script performs a 32-element dot product of $1.0 \times 1.0$ with no scaling.

Tiny Tapeout DevKit Pin Mapping

Signal	RP2040 (v2.0/v3.1)	RP2350 (v3.2)
`ui_in[7:0]`	GPIO 0-7	GPIO 17-24
`uo_out[7:0]`	GPIO 8-15	GPIO 33-40
`uio[7:0]`	GPIO 16-23	GPIO 25-32
`clk`	GPIO 24	GPIO 16
`rst_n`	GPIO 25	GPIO 14
`ena`	GPIO 26	GPIO 15

For the full script and advanced usage, see test/TT_MAC_RUN.PY.

OCP MX Feature Support

This implementation follows the OCP Microscaling Formats (MX) Specification (v1.0).

Implemented Features

Multiple Element Formats:
- MXFP8: E4M3 (Bias 7) and E5M2 (Bias 15).
- MXFP6: E3M2 (Bias 3) and E2M3 (Bias 1).
- MXFP4: E2M1 (Bias 1).
- MXINT8: Standard and Symmetric 8-bit signed integers.
Shared Scaling: Hardware-accelerated application of shared scales ($X_A, X_B$) using the UE8M0 format (8-bit unsigned biased exponent, Bias 127).
Rounding Modes: Support for all four OCP MX rounding modes:
- TRN: Truncate (Towards Zero).
- CEL: Ceil (Towards $+\infty$).
- FLR: Floor (Towards $-\infty$).
- RNE: Round-to-Nearest-Ties-to-Even.
Overflow Methods: Configurable behavior for out-of-range results:
- SAT: Saturation (Clamp to Max/Min representable value).
- WRAP: Wrapping (Modulo arithmetic).
Mixed-Precision Operations: Independent format selection for Operand A and Operand B within a single MAC block.
OCP MX+ (Extended Mantissa): Higher precision for “Block Max” (BM) elements by repurposing exponent bits as an extended mantissa: $$V(A_{BM}) = S \cdot 2^{E_{max} - \text{Bias}} \cdot \left(1 + \frac{\text{concat}(E_i, M_i)}{2^{E_{bits} + M_{bits}}}\right) \cdot 2^{X_A - 127}$$
Efficiency: 41-cycle pipelined streaming protocol with Fast Start (Scale Compression) to reuse scales/formats across consecutive blocks.

Omitted Features & Deviations

Subnormal Support: The RTL fully supports subnormal elements (denormals) for all floating-point formats, providing high numerical accuracy for small values.
Fixed Block Size: The unit is hard-coded for a block size of $k=32$ elements.
NaN/Infinity Handling:
- E5M2 fully supports IEEE-754 style Infinities and NaNs.
- For other formats, the unit prioritizes saturation for out-of-range values, consistent with OCP MX “Saturation-only” modes for narrower formats.
Accumulator Precision: A 32-bit signed fixed-point accumulator is used, providing sufficient range for 32-element dot products of all supported formats.

FPGA Support

This project includes support for generating an FPGA bitstream for the Sipeed Tang Nano 4K (Gowin GW1NSR-4C).

For detailed build, flash, and test instructions, see the Tang Nano 4K Deployment & Testing Guide.

The bitstream is automatically generated by the GitHub Action defined in .github/workflows/gowin.yaml.

Pin Mapping for Tang Nano 4K

Signal	Tang Nano 4K Pin	Description
`ui_in[7:0]`	40-39, 35-30	Scale A / Elements A
`uo_out[7:0]`	9-7, 22, 44-41	Serialized Result
`uio[7:0]`	21-16, 13, 10	Scale B / Elements B
`clk`	45	Onboard 27MHz Clock (Target: 20MHz for timing closure)
`rst_n`	15	Button S1 (Reset)
`ena`	14	Button S2 (Enable)

Note: Pins are listed in MSB-to-LSB order where applicable.

Resources

Glossary

A comprehensive list of terms and acronyms used in this project can be found in the Project Glossary.

Compilation Options

The MAC unit is highly configurable through Verilog parameters. These can be adjusted to balance feature support against hardware area (gate count).

Hardware Parameters

Parameter	Default	Description
`ALIGNER_WIDTH`	32	Bit-width of the internal alignment datapath.
`ACCUMULATOR_WIDTH`	24	Bit-width of the fixed-point accumulator.
`SUPPORT_E4M3`	1	Enable support for E4M3 (MXFP8) format.
`SUPPORT_E5M2`	0	Enable support for E5M2 (MXFP8) format.
`SUPPORT_MXFP6`	0	Enable support for E3M2 and E2M3 (MXFP6) formats.
`SUPPORT_MXFP4`	1	Enable support for E2M1 (MXFP4) format.
`SUPPORT_INT8`	0	Enable support for INT8 and INT8_SYM formats.
`SUPPORT_PIPELINING`	0	Enable multiplier pipelining for higher clock frequencies.
`SUPPORT_ADV_ROUNDING`	0	Enable advanced rounding modes (RNE, CEL, FLR).
`SUPPORT_MIXED_PRECISION`	0	Allow different formats for Operand A and B.
`SUPPORT_VECTOR_PACKING`	0	Enable 2x throughput for FP4 using vector packing.
`SUPPORT_PACKED_SERIAL`	0	Enable bit-serial throughput for packed FP4 formats.
`SUPPORT_INPUT_BUFFERING`	0	Enable input buffering for FP4 formats.
`SUPPORT_MX_PLUS`	0	Enable MX+ extensions (Repurposed Exponents).
`SUPPORT_SERIAL`	1	Enable bit-serial multiplier core (reduces area).
`SERIAL_K_FACTOR`	8	Bit-serial period (typically 8 for FP8).
`ENABLE_SHARED_SCALING`	0	Enable OCP MX Shared Scaling logic.
`USE_LNS_MUL`	0	Use Logarithmic Number System (LNS) multiplier core.
`USE_LNS_MUL_PRECISE`	0	Use precise LUT-based LNS (higher area).

Pre-defined Variants

The project includes a configuration script (scripts/configure_variant.py) to quickly switch between common profiles:

Baseline: Full feature set enabled, 40-bit aligner, 32-bit accumulator, parallel multipliers.
Light/Lite: Balanced configuration with MXFP6, Vector Packing, and MX+ disabled.
Tiny: Minimal footprint with only essential FP8 support enabled.