OCP MXFP8 Streaming MAC Unit
High-Performance AI Inference Accelerator with Shared Scaling
1. General Description
The OCP MXFP8 Streaming MAC Unit is a high-performance, area-optimized arithmetic core designed for next-generation AI inference acceleration. Fully compliant with the OpenCompute (OCP) Microscaling Formats (MX) Specification v1.0, the unit supports a comprehensive suite of sub-8-bit floating-point and integer formats.
Featuring hardware-accelerated shared scaling and an area-efficient logarithmic multiplier path, the core is optimized for deployment in resource-constrained edge devices and large-scale AI accelerators alike. The “Full” edition provides a 2x2 tile configuration (Tiny Tapeout) with dual-lane processing capabilities.
2. Features
OCP MX Compliance: Full support for OCP Microscaling Formats v1.0.
Multi-Format Support:
FP8 (E4M3, E5M2)
FP6 (E3M2, E2M3)
FP4 (E2M1)
INT8 / INT8_SYM
High-Precision Datapath:
40-bit internal aligner for high-dynamic range support.
32-bit signed fixed-point accumulator.
Vector Packing: 2x throughput for 4-bit formats (FP4) via dual-lane streaming.
Hardware Scaling: Automatic UE8M0 shared exponent application (\(2^{E-127}\)).
Flexible Rounding: Supports RNE (Round-to-Nearest-Even), TRN, CEL, and FLR.
MX+ Extensions: Extended mantissa for “Block Max” outliers to improve accuracy.
LNS Mode: Integrated Mitchell’s Approximation for area-optimized multiplication.
Logic Analyzer Mode: 14 selectable internal probes for real-time silicon monitoring.
Mixed Precision: Independent format selection for Operand A and Operand B.
3. Applications
Mobile and Edge AI Inference.
Deep Learning Accelerator (DLA) sub-modules.
Convolutional Neural Network (CNN) hardware acceleration.
Quantized Large Language Model (LLM) execution.
Low-power DSP for IoT and wearables.
4. Functional Block Diagram
The unit follows a pipelined architecture consisting of a dual-lane multiplier, a high-dynamic range aligner, and a 32-bit accumulator.
5. Pin Configuration and Functions
The unit utilizes an 8-bit streaming interface to minimize pin count while maintaining high throughput.
Pin |
Name |
Type |
Description |
|---|---|---|---|
|
DATA_A |
I |
Operand A elements, Scale A, or Metadata 0. |
|
DATA_B |
I |
Operand B elements, Scale B, or Metadata 1. |
|
RESULT |
O |
Serialized 32-bit result or Debug Probe data. |
|
RESERVED |
O |
Driven to 0x00 (Configured as Inputs via |
|
CLK |
I |
System Clock (Target: 20MHz). |
|
RESET_N |
I |
Active-low asynchronous reset. |
|
ENA |
I |
Clock Enable. |
6. Detailed Description
6.1 Operational Modes
The MAC unit supports several specialized modes to balance throughput, area, and precision.
Standard Mode: Single-lane processing for 8-bit, 6-bit, and 4-bit formats over 41 cycles.
Packed Mode: Dual-lane processing for 4-bit formats (FP4), doubling throughput by reducing the stream phase to 16 cycles.
LNS Mode: Replaces the standard multiplier with a logarithmic adder using Mitchell’s Approximation, reducing area by ~50% in the multiplier core.
Debug Mode: Allows internal probing of the accumulator, FSM states, and multiplier results via the
uo_outport.
6.2 Streaming Protocol
The unit operates using a 41-cycle streaming protocol to process a block of 32 elements (\(k=32\)).
Cycle |
Input |
Input |
Output |
Phase |
|---|---|---|---|---|
0 |
Metadata 0 |
Metadata 1 |
0x00 / Probe |
IDLE / CONFIG |
1 |
Scale A |
Format A / BM A |
0x00 / Probe |
LOAD_CFG_A |
2 |
Scale B |
Format B / BM B |
0x00 / Probe |
LOAD_CFG_B |
3-34 |
Element \(A_i\) |
Element \(B_i\) |
0x00 / Probe |
STREAM |
35 |
- |
- |
Meta Echo |
FLUSH |
36 |
- |
- |
0x00 |
CALC |
37-40 |
- |
- |
Result [31:0] |
OUTPUT |
*Note: In Packed Mode, the STREAM phase is reduced to 16 cycles (Cycles 3-18).
6.3 Register Layouts (Cycle 0-2)
Cycle 0: Metadata 0 (ui_in)
Bit |
Name |
Description |
|---|---|---|
|
SHORT_PROT |
1: Reuse previous scales/formats; jump to Cycle 3. |
|
DEBUG_EN |
1: Enable internal probing and metadata echo. |
|
LOOPBACK_EN |
1: Enable XOR loopback ( |
|
LNS_MODE |
Multiplier mode: |
|
NBM_OFF_A |
Exponent offset for Operand A (MX++). |
Cycle 0: Metadata 1 (uio_in)
Bit |
Name |
Description |
|---|---|---|
|
MX_PLUS_EN |
1: Enable OCP MX+ extended mantissa. |
|
PACKED_EN |
1: Enable Vector Packing (2 elements/byte). |
|
OVFL_WRAP |
0: SAT (Saturate), 1: WRAP. |
|
ROUND_MODE |
|
|
NBM_OFF_B |
Exponent offset for Operand B / Format select. |
Cycle 1: Scale A and Config A
Port |
Name |
Description |
|---|---|---|
|
SCALE_A |
8-bit unsigned biased exponent (UE8M0, Bias 127). |
|
BM_IDX_A |
Block Max Index (0-31) for Operand A. |
|
FORMAT_A |
|
Cycle 2: Scale B and Config B
Port |
Name |
Description |
|---|---|---|
|
SCALE_B |
8-bit unsigned biased exponent (UE8M0, Bias 127). |
|
BM_IDX_B |
Block Max Index (0-31) for Operand B. |
|
FORMAT_B |
Independent format for Operand B. |
6.4 Format Support and Packing
The unit supports a wide range of OCP MX compliant formats. During the STREAM phase (Cycles 3-34), elements are presented on ui_in and uio_in.
Format |
\(E_{bits}\) |
\(M_{bits}\) |
Bias |
Bit Layout |
|---|---|---|---|---|
E4M3 |
4 |
3 |
7 |
|
E5M2 |
5 |
2 |
15 |
|
E3M2 |
3 |
2 |
3 |
|
E2M3 |
2 |
3 |
1 |
|
E2M1 |
2 |
1 |
1 |
|
INT8 |
- |
- |
- |
|
Packed FP4 (FP4/Dual)
When PACKED_EN=1 (Metadata 1) and both formats are FP4 (E2M1), the unit processes two elements per cycle per lane.
Bit |
Name |
Description |
|---|---|---|
|
Element i+1 |
High nibble contains the next element in the sequence. |
|
Element i |
Low nibble contains the current element. |
6.5 Debug Capabilities
The unit includes integrated logic analyzer probes for real-time silicon monitoring, enabled via DEBUG_EN=1 at Cycle 0.
Selector ( |
Signal Description |
Bit Mapping |
|---|---|---|
|
FSM State |
|
|
Exceptions |
|
|
Accumulator |
Live 32-bit accumulator (Byte-wise) |
|
Multiplier L0 |
Lane 0 product (MSB/LSB) |
|
Control |
ENA, Strobe, Acc_En, Acc_Clear |
|
L0 Metadata |
|
|
Multiplier L1 |
Lane 1 product (MSB/LSB) |
|
L1 Metadata |
|
6.6 FP4 Fast Mode
The unit provides a high-throughput FP4 Fast Lane mode by combining Vector Packing and the Short Protocol.
In this mode:
Vector Packing (
uio_in[6]=1at Cycle 0) enables dual-lane processing, reducing the STREAM phase from 32 to 16 cycles.Short Protocol (
ui_in[7]=1at Cycle 0) bypasses the Scale/Format load cycles (Cycles 1-2).The total block latency is reduced from 41 cycles to 23 cycles (1 Config + 16 Stream + 6 Flush/Output).
7. Application and Implementation
7.1 Typical Application Circuit
The MAC unit is typically interfaced with a host MCU (e.g., RP2040 or a RISC-V core like SERV) via the 8-bit ui_in and uio_in buses.
7.2 Firmware Example (C-style)
// Perform a single MX block operation (32 elements)
void run_mac_block(uint8_t* a, uint8_t* b, uint8_t scale_a, uint8_t scale_b) {
// Cycle 0: Configuration
tt_write(0, 0x00, 0x00);
// Cycle 1-2: Load Scales and Formats
tt_write(1, scale_a, 0x00); // Set Scale A and Format A (E4M3)
tt_write(2, scale_b, 0x00); // Set Scale B and Format B (E4M3)
// Cycles 3-34: Stream Elements
for(int i=0; i<32; i++) {
tt_write(3+i, a[i], b[i]);
}
// Result ready at Cycle 37-40
uint32_t result = tt_read_result();
}
8. Package and Ordering Information
The unit is delivered as a hard macro within the Tiny Tapeout 2x2 tile framework.
Part Number |
Description |
Gate Count |
Tile Size |
|---|---|---|---|
TT-MXFP8-F |
Full Edition (Dual-Lane, MX+) |
~6800 |
2x2 |
TT-MXFP8-L |
Lite Edition (Single-Lane) |
~4000 |
1x1 |
TT-MXFP8-T |
Tiny Edition (Minimal) |
~2200 |
1x1 |
9. Revision History
Revision |
Date |
Description |
|---|---|---|
1.0 |
2024-05 |
Initial release for Tiny Tapeout. |
1.1 |
2025-03 |
Expanded to comprehensive reference manual. |
Appendix: Mathematics
OCP MX+ (Extended Mantissa)
MX+ leverages the redundancy in the Block Max element (where the exponent is always \(E_{max}\)) to provide additional mantissa precision: $\(V(A_{BM}) = S \cdot 2^{E_{max} - \text{Bias}} \cdot \left(1 + \frac{\text{concat}(E_i, M_i)}{2^{E_{bits} + M_{bits}}}\right) \cdot 2^{X_A - 127}\)$
Mitchell’s Approximation (LNS Mode)
In LNS mode, multiplication is simplified to the addition of logarithmic representations: $\(\log_2(1+M) \approx M\)\( \)\(\log_2(A \times B) \approx (E_A + M_A) + (E_B + M_B) - 2 \times \text{Bias}\)$
Thank you!
Special thanks to the Tiny Tapeout and IHP communities for supporting open-source silicon development.