### 2025 Workshop on



# Remote Memory ~ Local Memory over Reconfigurable Ethernet Fabric [1]

Vishal Shrivastav

Joint work with Weigao Su



#### Reference:

[1] "EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation". Weigao Su and Vishal Shrivastav. ASPLOS 2025

#### Why memory disaggregation?

- The need for memory is surging
- Constraints of individual servers
- Fine-grained pooling, elastic scaling



#### Why memory disaggregation?

- The need for memory is surging
- Constraints of individual servers
- Fine-grained pooling, elastic scaling





#### Why is Ethernet promising?

- Dominant datacenter network fabric
  - Low management cost, distance scaling...
- High bandwidth (Terabit Ethernet link)

#### Why memory disaggregation?

- The need for memory is surging
- Constraints of individual servers
- Fine-grained pooling, elastic scaling



#### Why is Ethernet promising?

- Dominant datacenter network fabric
  - Low management cost, distance scaling...
- High bandwidth (Terabit Ethernet link)



### Memory Disaggregation over Ethernet



### Memory Disaggregation over Ethernet



## However, the latency in Ethernet is prohibitive, prompting proposals of <u>separate</u> fabric to carry memory traffic

Custom processor interconnect, PCIe, Infiniband, etc.



## However, the latency in Ethernet is prohibitive, prompting proposals of <u>separate</u> fabric to carry memory traffic

Custom processor interconnect, PCIe, Infiniband, etc.



But, separate fabrics for different traffic makes the network costly and harder to manage

## However, the latency in Ethernet is prohibitive, prompting proposals of *separate* fabric to carry memory traffic

Custom processor interconnect, PCIe, Infiniband, etc.

A low latency Ethernet fabric would allow us to have a single unified network fabric to carry all kinds of traffic (memory, storage, IP, ...)

... easier to manage, lower cost, statistical bandwidth multiplexing



But, separate fabrics for different traffic makes the network costly and harder to manage

### Memory Disaggregation over Ethernet



### Memory Disaggregation over Ethernet



### Research goal

# Achieving near intra-server memory access latency over rack-scale Ethernet

(while maintaining high bandwidth utilization)



















#### 1. Ethernet MAC enforces minimum 64B frame

... but memory messages can be much smaller (e.g., read requests are typically 8-16B)





#### 1. Ethernet MAC enforces minimum 64B frame

... but memory messages can be much smaller (e.g., read requests are typically 8-16B)





- 1. Ethernet MAC enforces minimum 64B frame
  - ... but memory messages can be much smaller (e.g., read requests are typically 8-16B)
- 2. Ethernet MAC enforces minimum of 12 bytes Inter-frame gap (IFG)
  - ... high overhead for small memory messages





- 1. Ethernet MAC enforces minimum 64B frame
  - ... but memory messages can be much smaller (e.g., read requests are typically 8-16B)
- 2. Ethernet MAC enforces minimum of 12 bytes Inter-frame gap (IFG)
  - ... high overhead for small memory messages





- 1. Ethernet MAC enforces minimum 64B frame
  - ... but memory messages can be much smaller (e.g., read requests are typically 8-16B)
- 2. Ethernet MAC enforces minimum of 12 bytes Inter-frame gap (IFG)
  - ... high overhead for small memory messages
- 3. Ethernet MAC does not allow intra-frame preemption
  - ... a large non-memory frame may block the transmission of a small memory message





- 1. Ethernet MAC enforces minimum 64B frame
  - ... but memory messages can be much smaller (e.g., read requests are typically 8-16B)
- 2. Ethernet MAC enforces minimum of 12 bytes Inter-frame gap (IFG)
  - ... high overhead for small memory messages
- 3. Ethernet MAC does not allow intra-frame preemption
  - ... a large non-memory frame may block the transmission of a small memory message





- 1. Ethernet MAC enforces minimum 64B frame
  - ... but memory messages can be much smaller (e.g., read requests are typically 8-16B)
- 2. Ethernet MAC enforces minimum of 12 bytes Inter-frame gap (IFG)
  - ... high overhead for small memory messages
- 3. Ethernet MAC does not allow intra-frame preemption
  - ... a large non-memory frame may block the transmission of a small memory message





- 1. Ethernet MAC enforces minimum 64B frame ... but memory messages can be much smaller (e.g., read requests are typically 8-16B)
- 2. Ethernet MAC enforces minimum of 12 bytes Inter-frame gap (IFG)
  - ... high overhead for small memory messages
- 3. Ethernet MAC does not allow intra-frame preemption
  - ... a large non-memory frame may block the transmission of a small memory message

Root cause: MAC layer processing

### Design Choice # 1:

Implement the entire protocol for remote memory access within Ethernet's Physical Layer (PHY)

#### Architecture of Remote Memory Protocol in the PHY



#### Rationale for Remote Memory Protocol in PHY



### Ethernet PHY already reformats a MAC layer frame into a series of 66-bit PHY blocks

... thus, unlike the MAC layer that works at a <u>frame</u> granularity, PHY works at fine-grained <u>block</u> granularity

- 66 bit PHY block vs. 64 byte minimum MAC frame size
- Message interleaving can be done at block granularity in PHY rather than at frame granularity in MAC
- PHY also has access to IFG blocks

#### Rationale for Remote Memory Protocol in PHY



### Ethernet PHY already reformats a MAC layer frame into a series of 66-bit PHY blocks

... thus, unlike the MAC layer that works at a <u>frame</u> granularity, PHY works at fine-grained <u>block</u> granularity

- 66 bit PHY block vs. 64 byte minimum MAC frame size
- Message interleaving can be done at block granularity in PHY rather than at frame granularity in MAC
- PHY also has access to IFG blocks

#### Architecture of Remote Memory Protocol in the PHY



#### Remote Memory Protocol in the PHY: What about latency?



### Design Choice # 2:

# Packet Switching → Reconfigurable (Circuit) Switching

Using a centralized memory traffic scheduler implemented in the PHY of the Ethernet switch

#### **Central Scheduler in the Switch PHY**













• Challenge 1: Accurate traffic demand estimation

• Challenge 2: Send demands to the switch with low bandwidth, latency overhead

Challenge 3: Line rate, low latency scheduler pipeline

- Challenge 1: Accurate traffic demand estimation
  - Solution: Leverage the interface to memory controller

• Challenge 2: Send demands to the switch with low bandwidth, latency overhead

Challenge 3: Line rate, low latency scheduler pipeline

- Challenge 1: Accurate traffic demand estimation
  - Solution: Leverage the interface to memory controller
- Challenge 2: Send demands to the switch with low bandwidth, latency overhead
  - Solution: Leverage request-reply nature of memory access
- Challenge 3: Line rate, low latency scheduler pipeline

- Challenge 1: Accurate traffic demand estimation
  - Solution: Leverage the interface to memory controller
- Challenge 2: Send demands to the switch with low bandwidth, latency overhead
  - Solution: Leverage request-reply nature of memory access
- Challenge 3: Line rate, low latency scheduler pipeline
  - Solution: Leverage hardware parallelism in switch's PHY

## Implementation & Evaluation



#### **Hardware Testbed**

- Three Xilinx Alveo U200 FPGAs
- Open-source 25GbE (Corundum)
- Synopsys ASIC RTL compiler

#### **Evaluation Result**

End-to-end unloaded latency



#### **Evaluation Result**

End-to-end unloaded latency



## Implementation & Evaluation



#### **Hardware Testbed**

- Three Xilinx Alveo U200 FPGAs
- Open-source 25GbE (Corundum)
- ·Synopsys ASIC RTL compiler

#### **Network Simulator**

- A single rack with 144 nodes
- Fed with real-world traces
- Compare against 6 classes of scheduling / congestion control

#### **Evaluation Result**

Disaggregated workloads in a loaded network

| Experiment name  | Dataset                                            |
|------------------|----------------------------------------------------|
| Hadoop,<br>Spark | Generator<br>@ <u>sortbench</u><br><u>mark.org</u> |
| Spark SQL        | Big Data<br>Benchmark@<br><u>Berkeley</u>          |
| GraphLab         | Movie rating<br>data @ <u>Netflix</u>              |
| Memcached        | KV-<br>store@ <u>YCSB</u>                          |



### Summary

- EDM is a low latency Ethernet fabric for memory disaggregation.
- EDM uses two ideas for low latency w/ high bandwidth utilization:
  - EDM implements the protocol for remote memory access entirely in the Ethernet PHY.
  - EDM implements a **fast, centralized memory traffic scheduler** in the switch's PHY.
- EDM incurs a latency of ~300ns (7x lower than RoCE) in an unloaded network, and < 1.3x its unloaded latency under heavy network loads.

# Thank you!

Code: <a href="https://github.com/wegul/EDM">https://github.com/wegul/EDM</a>