# DQEMU: A Scalable Emulator with Retargetable DBT on Distributed Platforms

**Ziyi Zhao**, Zhang Jiang, Ximing Liu, Xiaoli Gong\* Nankai University

Wenwen Wang

University of Georgia

Pen-Chung Yew

University of Minnesota



UNIVERSITY OF GEORGIA

UNIVERSITY OF MINNESOTA Driven to Discover™

# **Dynamic Binary Translation(DBT)**

"A Key Enabling Technology"



# The scalability of DBT is limited by computing resources



## Goal: Enable DBT to utilize compute resources across nodes



# Goal: Enable DBT to utilize compute resources across nodes

In a **distributed** emulator...

- How to maintain *guest* cache coherence?
  - Transparently
- How to emulate *guest* **system calls**?
  - Side effect to host kernel
- How to emulate *guest* **atomic operations**?
  - Equivalent atomic sematic between  $RISC \leftarrow \rightarrow CISC$

# How does DBT work?

| Guest Code        | ldr x21, [x1, #0x758] AArc                                                                            |
|-------------------|-------------------------------------------------------------------------------------------------------|
| Intermediate Code | mov_i64 tmp2,x1<br>movi_i64 tmp3,\$0x758<br>add_i64 tmp2,tmp2,tmp3<br>qemu_ld_i64 x21,tmp3,leq,0      |
| Host Code         | movq 0x48(%rbp), %rbx<br>addq \$0x758, %rbx <b>X86</b><br>movq 0(%rbx), %rbx<br>movq %rbx, 0xe8(%rbp) |



Host OS

Introduction

Results

### What should Distributed DBT looks like?



# How to keep cache coherence?

For the Distributed Shared Memory Region...

- At what granularity?
  - Cache line size? Page size? Larger?
- How to check privilege?
  - <u>Software-based instrumentation: check on every memory access</u>
  - Hardware-based: MMU host page level check
- Which type of protocol?
  - Distributed / Centralized
  - MSI

# How to keep cache coherence?



• Utilize host MMU to do state check

| State          | Page Protection |
|----------------|-----------------|
| Modified       | RW              |
| <b>S</b> hared | R-              |
| Invalid        |                 |
|                |                 |

• Synchronize granularity = 4K(host page size)

# The problem of system calls



- Eg. *fopen*() by a worker thread at node#2 affects
  - User-space file descriptor
  - Kernel-space resource manager
- Syscalls also affects host kernel

# The problem of system calls – Syscall Delegate

#### **Global Syscall**

read, write, openat, open, fstat, close, stat64, lstat64, fstat64, futex, writev, brk, mmap2, mprotect, madvise, mumap, **clone**, vfork, **futex** 

#### Local Syscall

gettimeofday, clock\_gettime, exit, nanosleep, ... all the rest



• Guest CPU state

Introduction Implementation Optimization Results The emulation of atomic operations RISC CISC ARM, MIPS... x86 LL(Load-linked) CAS(Compare and Swap) SC(Store-conditional) Translate? (1)Page privilege (2)Check if II addr equals pf addr, clean the check by MMU entry if equal, else it is a false alarm. rwx page rwx fault uint64\_t II addr r-x uint64 t tid rwx PST hash table Linked-address

# The emulation of atomic operations



#### **Hierarchical lock**

Intra-node: Consistency model translation[ArMOR]
Inter-node: MSI Coherence Protocol – Sequential

# Page Split: The false sharing overhead



- **Probability**: cache line size **64B** → page size **4096B**
- **Cost**: cache miss **23 cycles** → network + pagefault >= **120000cycles**

Introduction

# Page Split: The false sharing overhead



- Reduce false sharing possibility
- Compatible with cache coherence protocol

# Hint-based thread scheduling: data sharing among nodes



# Hint-based thread scheduling: data sharing among nodes





#### Optimization



#### **Experiment Setup**



| TP-Link TL-SG1024DT Gigabit Switch  |
|-------------------------------------|
| Quad-core Intel i5-6500@3.30GHz CPU |
| 12GB                                |
| Linux 4.15.0 Ubuntu 18.04           |
| micro bench, PARSEC-3.0             |
| Guest: ARM → Host: X64              |
| QEMU-4.2.0                          |
|                                     |

| Memory Access Performance     Sequential memory access | Results | nization | ation Optir              | Implementatic | Introduction |
|--------------------------------------------------------|---------|----------|--------------------------|---------------|--------------|
|                                                        |         | ance     | ry Access Perform        | Memory        |              |
|                                                        |         |          | Sequential memory access | Se            |              |
| DQEIVID                                                |         | DQEMU    | DQEMU                    | QEMU          |              |
| Memory Memory Memory                                   | _       | Memory   | Memory                   | Memory        |              |

| Access Type                   | Throughput(MB/s) | Latency(us) |
|-------------------------------|------------------|-------------|
| <b>QEMU Sequential Access</b> | 173.06           | -           |
| Remote Sequential Access      | 7.88             | 410.5       |
| Page forwarding Enabled       | 108.01           | 83.2        |

Implementation



#### **Memory Access Performance**



| Access Type              | Throughput(MB/s) |
|--------------------------|------------------|
| QEMU Access of 128 bytes | 20,259           |
| False Sharing of 1 Page  | 2,216            |
| Page Splitting Enabled   | 75,294           |

Introduction

Implementation

Optimization

**Results** 

### **Atomic Operation Performance**



# **Scalability - Ideal**



# **Scalability – Parallel Programs**



# **Scalability – Parallel Programs**



Introduction

# Scalability – Heavy data sharing program



# Discussion

- A more scalable coherence protocol?
- Random memory access hurts DSM.
- What kind of program suits DQEMU? How to recognize?
- Support various host ISA → Heterogeneous computing?

Thank you! Q&A