OCP MXFP8 Streaming MAC Unit

High-Performance AI Inference Accelerator with Shared Scaling

1. General Description

The OCP MXFP8 Streaming MAC Unit is a high-performance, area-optimized arithmetic core designed for next-generation AI inference acceleration. Fully compliant with the OpenCompute (OCP) Microscaling Formats (MX) Specification v1.0, the unit supports a comprehensive suite of sub-8-bit floating-point and integer formats.

Featuring hardware-accelerated shared scaling and an area-efficient logarithmic multiplier path, the core is optimized for deployment in resource-constrained edge devices and large-scale AI accelerators alike. The “Full” edition provides a 2x2 tile configuration (Tiny Tapeout) with dual-lane processing capabilities.

2. Features

  • OCP MX Compliance: Full support for OCP Microscaling Formats v1.0.

  • Multi-Format Support:

    • FP8 (E4M3, E5M2)

    • FP6 (E3M2, E2M3)

    • FP4 (E2M1)

    • INT8 / INT8_SYM

  • High-Precision Datapath:

    • 40-bit internal aligner for high-dynamic range support.

    • 32-bit signed fixed-point accumulator.

  • Vector Packing: 2x throughput for 4-bit formats (FP4) via dual-lane streaming.

  • Hardware Scaling: Automatic UE8M0 shared exponent application (\(2^{E-127}\)).

  • Flexible Rounding: Supports RNE (Round-to-Nearest-Even), TRN, CEL, and FLR.

  • MX+ Extensions: Extended mantissa for “Block Max” outliers to improve accuracy.

  • LNS Mode: Integrated Mitchell’s Approximation for area-optimized multiplication.

  • Logic Analyzer Mode: 14 selectable internal probes for real-time silicon monitoring.

  • Mixed Precision: Independent format selection for Operand A and Operand B.

3. Applications

  • Mobile and Edge AI Inference.

  • Deep Learning Accelerator (DLA) sub-modules.

  • Convolutional Neural Network (CNN) hardware acceleration.

  • Quantized Large Language Model (LLM) execution.

  • Low-power DSP for IoT and wearables.

4. Functional Block Diagram

The unit follows a pipelined architecture consisting of a dual-lane multiplier, a high-dynamic range aligner, and a 32-bit accumulator.

System Context Diagram


5. Pin Configuration and Functions

The unit utilizes an 8-bit streaming interface to minimize pin count while maintaining high throughput.

Pin

Name

Type

Description

ui_in[7:0]

DATA_A

I

Operand A elements, Scale A, or Metadata 0.

uio_in[7:0]

DATA_B

I

Operand B elements, Scale B, or Metadata 1.

uo_out[7:0]

RESULT

O

Serialized 32-bit result or Debug Probe data.

uio_out[7:0]

RESERVED

O

Driven to 0x00 (Configured as Inputs via uio_oe).

clk

CLK

I

System Clock (Target: 20MHz).

rst_n

RESET_N

I

Active-low asynchronous reset.

ena

ENA

I

Clock Enable.


6. Detailed Description

6.1 Operational Modes

The MAC unit supports several specialized modes to balance throughput, area, and precision.

  • Standard Mode: Single-lane processing for 8-bit, 6-bit, and 4-bit formats over 41 cycles.

  • Packed Mode: Dual-lane processing for 4-bit formats (FP4), doubling throughput by reducing the stream phase to 16 cycles.

  • LNS Mode: Replaces the standard multiplier with a logarithmic adder using Mitchell’s Approximation, reducing area by ~50% in the multiplier core.

  • Debug Mode: Allows internal probing of the accumulator, FSM states, and multiplier results via the uo_out port.

6.2 Streaming Protocol

The unit operates using a 41-cycle streaming protocol to process a block of 32 elements (\(k=32\)).

Cycle

Input ui_in

Input uio_in

Output uo_out

Phase

0

Metadata 0

Metadata 1

0x00 / Probe

IDLE / CONFIG

1

Scale A

Format A / BM A

0x00 / Probe

LOAD_CFG_A

2

Scale B

Format B / BM B

0x00 / Probe

LOAD_CFG_B

3-34

Element \(A_i\)

Element \(B_i\)

0x00 / Probe

STREAM

35

-

-

Meta Echo

FLUSH

36

-

-

0x00

CALC

37-40

-

-

Result [31:0]

OUTPUT

*Note: In Packed Mode, the STREAM phase is reduced to 16 cycles (Cycles 3-18).

6.3 Register Layouts (Cycle 0-2)

Cycle 0: Metadata 0 (ui_in) Metadata 0 (ui_in)

Bit

Name

Description

[7]

SHORT_PROT

1: Reuse previous scales/formats; jump to Cycle 3.

[6]

DEBUG_EN

1: Enable internal probing and metadata echo.

[5]

LOOPBACK_EN

1: Enable XOR loopback (uo_out = ui_in ^ uio_in).

[4:3]

LNS_MODE

Multiplier mode: 0: Normal, 1: LNS, 2: Hybrid.

[2:0]

NBM_OFF_A

Exponent offset for Operand A (MX++).

Cycle 0: Metadata 1 (uio_in) Metadata 1 (uio_in)

Bit

Name

Description

[7]

MX_PLUS_EN

1: Enable OCP MX+ extended mantissa.

[6]

PACKED_EN

1: Enable Vector Packing (2 elements/byte).

[5]

OVFL_WRAP

0: SAT (Saturate), 1: WRAP.

[4:3]

ROUND_MODE

0: TRN, 1: CEL, 2: FLR, 3: RNE.

[2:0]

NBM_OFF_B

Exponent offset for Operand B / Format select.

Cycle 1: Scale A and Config A Scale A Config A

Port

Name

Description

ui_in[7:0]

SCALE_A

8-bit unsigned biased exponent (UE8M0, Bias 127).

uio_in[7:3]

BM_IDX_A

Block Max Index (0-31) for Operand A.

uio_in[2:0]

FORMAT_A

0: E4M3, 1: E5M2, 2: E3M2, 3: E2M3, 4: E2M1, 5: INT8.

Cycle 2: Scale B and Config B Scale B Config B

Port

Name

Description

ui_in[7:0]

SCALE_B

8-bit unsigned biased exponent (UE8M0, Bias 127).

uio_in[7:3]

BM_IDX_B

Block Max Index (0-31) for Operand B.

uio_in[2:0]

FORMAT_B

Independent format for Operand B.

6.4 Format Support and Packing

The unit supports a wide range of OCP MX compliant formats. During the STREAM phase (Cycles 3-34), elements are presented on ui_in and uio_in.

Format

\(E_{bits}\)

\(M_{bits}\)

Bias

Bit Layout

E4M3

4

3

7

S[7] E[6:3] M[2:0]

E5M2

5

2

15

S[7] E[6:2] M[1:0]

E3M2

3

2

3

S[5] E[4:2] M[1:0] (Bits [7:6] ignored)

E2M3

2

3

1

S[5] E[4:3] M[2:0] (Bits [7:6] ignored)

E2M1

2

1

1

S[3] E[2:1] M[0] (Bits [7:4] ignored)

INT8

-

-

-

S[7] V[6:0] (Two’s complement)

Packed FP4 (FP4/Dual) When PACKED_EN=1 (Metadata 1) and both formats are FP4 (E2M1), the unit processes two elements per cycle per lane. Packed FP4

Bit

Name

Description

[7:4]

Element i+1

High nibble contains the next element in the sequence.

[3:0]

Element i

Low nibble contains the current element.

6.5 Debug Capabilities

The unit includes integrated logic analyzer probes for real-time silicon monitoring, enabled via DEBUG_EN=1 at Cycle 0.

Selector (uio_in[3:0] @ C0)

Signal Description

Bit Mapping

0x1

FSM State

[7:6] State, [5:0] logical_cycle

0x2

Exceptions

[7] nan_sticky, [6] inf_pos, [5] inf_neg, [4] strobe

0x3-0x6

Accumulator

Live 32-bit accumulator (Byte-wise)

0x7-0x8

Multiplier L0

Lane 0 product (MSB/LSB)

0x9

Control

ENA, Strobe, Acc_En, Acc_Clear

0xA

L0 Metadata

[7] sign, [6] nan, [5] inf, [4:0] exp_sum

0xB-0xC

Multiplier L1

Lane 1 product (MSB/LSB)

0xD

L1 Metadata

[7] sign, [6] nan, [5] inf, [4:0] exp_sum

6.6 FP4 Fast Mode

The unit provides a high-throughput FP4 Fast Lane mode by combining Vector Packing and the Short Protocol.

In this mode:

  • Vector Packing (uio_in[6]=1 at Cycle 0) enables dual-lane processing, reducing the STREAM phase from 32 to 16 cycles.

  • Short Protocol (ui_in[7]=1 at Cycle 0) bypasses the Scale/Format load cycles (Cycles 1-2).

  • The total block latency is reduced from 41 cycles to 23 cycles (1 Config + 16 Stream + 6 Flush/Output).

7. Application and Implementation

7.1 Typical Application Circuit

The MAC unit is typically interfaced with a host MCU (e.g., RP2040 or a RISC-V core like SERV) via the 8-bit ui_in and uio_in buses.

7.2 Firmware Example (C-style)

// Perform a single MX block operation (32 elements)
void run_mac_block(uint8_t* a, uint8_t* b, uint8_t scale_a, uint8_t scale_b) {
    // Cycle 0: Configuration
    tt_write(0, 0x00, 0x00);
    // Cycle 1-2: Load Scales and Formats
    tt_write(1, scale_a, 0x00); // Set Scale A and Format A (E4M3)
    tt_write(2, scale_b, 0x00); // Set Scale B and Format B (E4M3)
    // Cycles 3-34: Stream Elements
    for(int i=0; i<32; i++) {
        tt_write(3+i, a[i], b[i]);
    }
    // Result ready at Cycle 37-40
    uint32_t result = tt_read_result();
}

8. Package and Ordering Information

The unit is delivered as a hard macro within the Tiny Tapeout 2x2 tile framework.

Part Number

Description

Gate Count

Tile Size

TT-MXFP8-F

Full Edition (Dual-Lane, MX+)

~6800

2x2

TT-MXFP8-L

Lite Edition (Single-Lane)

~4000

1x1

TT-MXFP8-T

Tiny Edition (Minimal)

~2200

1x1


9. Revision History

Revision

Date

Description

1.0

2024-05

Initial release for Tiny Tapeout.

1.1

2025-03

Expanded to comprehensive reference manual.


Appendix: Mathematics

OCP MX+ (Extended Mantissa)

MX+ leverages the redundancy in the Block Max element (where the exponent is always \(E_{max}\)) to provide additional mantissa precision: $\(V(A_{BM}) = S \cdot 2^{E_{max} - \text{Bias}} \cdot \left(1 + \frac{\text{concat}(E_i, M_i)}{2^{E_{bits} + M_{bits}}}\right) \cdot 2^{X_A - 127}\)$

Mitchell’s Approximation (LNS Mode)

In LNS mode, multiplication is simplified to the addition of logarithmic representations: $\(\log_2(1+M) \approx M\)\( \)\(\log_2(A \times B) \approx (E_A + M_A) + (E_B + M_B) - 2 \times \text{Bias}\)$

Thank you!

Special thanks to the Tiny Tapeout and IHP communities for supporting open-source silicon development.