[v1] Emulate guest vector operations with host vector operations

[Qemu-devel] [PATCH v2.1 00/20] Emulate guest vector operations with host vector operations

Posted by Kirill Batuzov 9 years ago

The goal of these patch series is to set up an infrastructure to emulate
guest vector operations using host vector operations. Preliminary
experiments show that simply translating loads and stores increases
performance of x264 video codec by 10%. The performance of a gcc vectorized
for loop increased 2x.

To be able to emulate guest vector operations using host vector operations,
several things need to be done.

1. Corresponding vector types should be added to TCG. These series add
TCG_v128 and TCG_v64. I've made TCG_v64 a different type than TCG_i64
because it usually needs to be allocated to different registers and
supports different operations.

2. Load/store operations for these new types need to be implemented.

3. For seamless transition from current model to a new one we need to
handle cases where memory occupied by global variable can be accessed via
pointer to the CPUArchState structure. A very simple conservative alias
analysis has been added to do it. This analysis tracks memory loads and
stores that overlap with fields of CPUArchState and provides this
information to the register allocator. The allocator then spills and
reloads affected globals when needed.

4. Allow overlapping globals. For scalar registers this is a rare case, and
overlapping registers can ba handled as a single one (ah, al, ax, eax,
rax). In ARM every Q-register consists of two D-register each consisting of
two S-registers. Handling 4 S-registers as one because they are parts of
the same Q-register is way too inefficient.

5. Add new memory addressing mode to MMU code for large accesses and create
needed helpers. Only 128-bit vectors have been handled for now.

6. Create TCG opcodes for vector operations. Only addition has beed handled
in these series. Each operation has a wrapper that checks if the backend
supports the corresponding operation or not. In one case the vector opcode
is generated, in the other the operation is emulated with scalar
operations. The emulation code is generated inline for performance reasons
(there is a huge performance difference between inline generation
and calling a helper). As a positive side effect this will eventually allow
 to merge similar emulation code for vector instructions from different
frontends to target-independent implementation.

7. Use new operations in the frontend (ARM was used in these series).

8. Support new operations in the backend (x86_64 was used in these series).

For experiments I have used ARM guest on x86_64 host. I wanted some pair of
different architectures with vector extensions both. ARM and x86_64 pair
fits well.

v1 -> v2:
 - represent v128 type with smaller types when it is not supported by the host
 - detect AVX support and use AVX instructions when available
 - tcg/README updated
 - generate two v64 adds instead of one v128 when applicable
 - rebased to newer master
 - overlap detection for temps added (it needs to be explicitly called from
   <arch>_translate_init)
 - the stack is used to temporary store 128 bit variables to memory
   (instead of the TCGContext field)

v2 -> v2.1
 - automatic build failure fixed

Outstanding issues:
 - qemu_ld_v128 and qemu_st_v128 do not generate fallback code if the host
   does not support 128 bit registers. The reason is that I do not know how to
   handle the host/guest different endianness (whether do we swap only bytes
   in elements or whole vectors?). Different targets seem to have different
   ideas on how this should be done.

Kirill Batuzov (20):
  tcg: add support for 128bit vector type
  tcg: add support for 64bit vector type
  tcg: support representing vector type with smaller vector or scalar
    types
  tcg: add ld_v128, ld_v64, st_v128 and st_v64 opcodes
  tcg: add simple alias analysis
  tcg: use results of alias analysis in liveness analysis
  tcg: allow globals to overlap
  tcg: add vector addition operations
  target/arm: support access to vector guest registers as globals
  target/arm: use vector opcode to handle vadd.<size> instruction
  tcg/i386: add support for vector opcodes
  tcg/i386: support 64-bit vector operations
  tcg/i386: support remaining vector addition operations
  tcg: do not rely on exact values of MO_BSWAP or MO_SIGN in backend
  tcg: introduce new TCGMemOp - MO_128
  tcg: introduce qemu_ld_v128 and qemu_st_v128 opcodes
  softmmu: create helpers for vector loads
  tcg/i386: add support for qemu_ld_v128/qemu_st_v128 ops
  target/arm: load two consecutive 64-bits vector regs as a 128-bit
    vector reg
  tcg/README: update README to include information about vector opcodes

Kirill Batuzov (21):
  tcg: add support for 128bit vector type
  tcg: add support for 64bit vector type
  tcg: support representing vector type with smaller vector or scalar
    types
  tcg: add ld_v128, ld_v64, st_v128 and st_v64 opcodes
  tcg: add simple alias analysis
  tcg: use results of alias analysis in liveness analysis
  tcg: allow globals to overlap
  tcg: add vector addition operations
  target/arm: support access to vector guest registers as globals
  target/arm: use vector opcode to handle vadd.<size> instruction
  tcg/i386: add support for vector opcodes
  tcg/i386: support 64-bit vector operations
  tcg/i386: support remaining vector addition operations
  tcg: do not rely on exact values of MO_BSWAP or MO_SIGN in backend
  target/aarch64: do not check for non-existent TCGMemOp
  tcg: introduce new TCGMemOp - MO_128
  tcg: introduce qemu_ld_v128 and qemu_st_v128 opcodes
  softmmu: create helpers for vector loads
  tcg/i386: add support for qemu_ld_v128/qemu_st_v128 ops
  target/arm: load two consecutive 64-bits vector regs as a 128-bit
    vector reg
  tcg/README: update README to include information about vector opcodes

 cputlb.c                     |   4 +
 softmmu_template_vector.h    | 266 +++++++++++++++++++++++++++++++
 target/arm/translate-a64.c   |   1 -
 target/arm/translate.c       |  76 ++++++++-
 tcg/README                   |  47 +++++-
 tcg/aarch64/tcg-target.inc.c |   4 +-
 tcg/arm/tcg-target.inc.c     |   4 +-
 tcg/i386/tcg-target.h        |  45 +++++-
 tcg/i386/tcg-target.inc.c    | 260 +++++++++++++++++++++++++++++--
 tcg/mips/tcg-target.inc.c    |   4 +-
 tcg/optimize.c               | 165 +++++++++++++++++++-
 tcg/ppc/tcg-target.inc.c     |   4 +-
 tcg/s390/tcg-target.inc.c    |   4 +-
 tcg/sparc/tcg-target.inc.c   |  12 +-
 tcg/tcg-op.c                 |  92 ++++++++++-
 tcg/tcg-op.h                 | 267 +++++++++++++++++++++++++++++++
 tcg/tcg-opc.h                |  34 ++++
 tcg/tcg.c                    | 363 +++++++++++++++++++++++++++++++++++++------
 tcg/tcg.h                    | 163 ++++++++++++++++++-
 19 files changed, 1722 insertions(+), 93 deletions(-)
 create mode 100644 softmmu_template_vector.h

-- 
2.1.4

Re: [Qemu-devel] [PATCH v2.1 00/20] Emulate guest vector operations with host vector operations

Posted by no-reply@patchew.org 9 years ago

Hi,

Your series seems to have some coding style problems. See output below for
more information:

Type: series
Subject: [Qemu-devel] [PATCH v2.1 00/20] Emulate guest vector operations with host vector operations
Message-id: 1486046099-17726-1-git-send-email-batuzovk@ispras.ru

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

# Useful git options
git config --local diff.renamelimit 0
git config --local diff.renames True

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
    echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
    if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
        failed=1
        echo
    fi
    n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 * [new tag]         patchew/1486046099-17726-1-git-send-email-batuzovk@ispras.ru -> patchew/1486046099-17726-1-git-send-email-batuzovk@ispras.ru
 * [new tag]         patchew/1486046738-26059-1-git-send-email-abologna@redhat.com -> patchew/1486046738-26059-1-git-send-email-abologna@redhat.com
Switched to a new branch 'test'
64bbc76 tcg/README: update README to include information about vector opcodes
06bc776 target/arm: load two consecutive 64-bits vector regs as a 128-bit vector reg
164b1f6 tcg/i386: add support for qemu_ld_v128/qemu_st_v128 ops
c227f30 softmmu: create helpers for vector loads
084c6df tcg: introduce qemu_ld_v128 and qemu_st_v128 opcodes
a9ef8cf tcg: introduce new TCGMemOp - MO_128
723589b target/aarch64: do not check for non-existent TCGMemOp
1b57606 tcg: do not rely on exact values of MO_BSWAP or MO_SIGN in backend
78bb60f tcg/i386: support remaining vector addition operations
a789efe tcg/i386: support 64-bit vector operations
7c67ff1 tcg/i386: add support for vector opcodes
183aaf5 target/arm: use vector opcode to handle vadd.<size> instruction
565699d target/arm: support access to vector guest registers as globals
777b055 tcg: add vector addition operations
2d56597 tcg: allow globals to overlap
188d844 tcg: use results of alias analysis in liveness analysis
8a0b599 tcg: add simple alias analysis
c8e50bc tcg: add ld_v128, ld_v64, st_v128 and st_v64 opcodes
6211ed3 tcg: support representing vector type with smaller vector or scalar types
98f37fb tcg: add support for 64bit vector type
8928fcf tcg: add support for 128bit vector type

=== OUTPUT BEGIN ===
Checking PATCH 1/21: tcg: add support for 128bit vector type...
Checking PATCH 2/21: tcg: add support for 64bit vector type...
Checking PATCH 3/21: tcg: support representing vector type with smaller vector or scalar types...
Checking PATCH 4/21: tcg: add ld_v128, ld_v64, st_v128 and st_v64 opcodes...
Checking PATCH 5/21: tcg: add simple alias analysis...
ERROR: spaces required around that ':' (ctx:VxE)
#81: FILE: tcg/optimize.c:1472:
+        CASE_OP_32_64(movi):
                            ^

ERROR: spaces required around that ':' (ctx:VxE)
#85: FILE: tcg/optimize.c:1476:
+        CASE_OP_32_64(mov):
                           ^

ERROR: spaces required around that ':' (ctx:VxE)
#90: FILE: tcg/optimize.c:1481:
+        CASE_OP_32_64(add):
                           ^

ERROR: spaces required around that ':' (ctx:VxE)
#91: FILE: tcg/optimize.c:1482:
+        CASE_OP_32_64(sub):
                           ^

ERROR: spaces required around that ':' (ctx:VxE)
#101: FILE: tcg/optimize.c:1492:
+        CASE_OP_32_64(ld8s):
                            ^

ERROR: spaces required around that ':' (ctx:VxE)
#102: FILE: tcg/optimize.c:1493:
+        CASE_OP_32_64(ld8u):
                            ^

ERROR: spaces required around that ':' (ctx:VxE)
#106: FILE: tcg/optimize.c:1497:
+        CASE_OP_32_64(ld16s):
                             ^

ERROR: spaces required around that ':' (ctx:VxE)
#107: FILE: tcg/optimize.c:1498:
+        CASE_OP_32_64(ld16u):
                             ^

ERROR: spaces required around that ':' (ctx:VxE)
#125: FILE: tcg/optimize.c:1516:
+        CASE_OP_32_64(st8):
                           ^

ERROR: spaces required around that ':' (ctx:VxE)
#129: FILE: tcg/optimize.c:1520:
+        CASE_OP_32_64(st16):
                            ^

total: 10 errors, 0 warnings, 196 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 6/21: tcg: use results of alias analysis in liveness analysis...
Checking PATCH 7/21: tcg: allow globals to overlap...
Checking PATCH 8/21: tcg: add vector addition operations...
Checking PATCH 9/21: target/arm: support access to vector guest registers as globals...
ERROR: that open brace { should be on the previous line
#38: FILE: target/arm/translate.c:82:
+static const char *regnames_q[] =
+    { "q0", "q1", "q2", "q3", "q4", "q5", "q6", "q7",

ERROR: that open brace { should be on the previous line
#42: FILE: target/arm/translate.c:86:
+static const char *regnames_d[] =
+    { "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",

total: 2 errors, 0 warnings, 52 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 10/21: target/arm: use vector opcode to handle vadd.<size> instruction...
Checking PATCH 11/21: tcg/i386: add support for vector opcodes...
Checking PATCH 12/21: tcg/i386: support 64-bit vector operations...
Checking PATCH 13/21: tcg/i386: support remaining vector addition operations...
ERROR: spaces required around that ':' (ctx:VxE)
#102: FILE: tcg/i386/tcg-target.inc.c:2404:
+    OP_V128_ALL(add):
                     ^

ERROR: spaces required around that ':' (ctx:VxE)
#103: FILE: tcg/i386/tcg-target.inc.c:2405:
+    OP_V64_ALL(add):
                    ^

total: 2 errors, 0 warnings, 121 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 14/21: tcg: do not rely on exact values of MO_BSWAP or MO_SIGN in backend...
Checking PATCH 15/21: target/aarch64: do not check for non-existent TCGMemOp...
Checking PATCH 16/21: tcg: introduce new TCGMemOp - MO_128...
Checking PATCH 17/21: tcg: introduce qemu_ld_v128 and qemu_st_v128 opcodes...
Checking PATCH 18/21: softmmu: create helpers for vector loads...
Checking PATCH 19/21: tcg/i386: add support for qemu_ld_v128/qemu_st_v128 ops...
Checking PATCH 20/21: target/arm: load two consecutive 64-bits vector regs as a 128-bit vector reg...
Checking PATCH 21/21: tcg/README: update README to include information about vector opcodes...
=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@freelists.org

Re: [Qemu-devel] [PATCH v2.1 00/20] Emulate guest vector operations with host vector operations

Posted by Kirill Batuzov 8 years, 11 months ago

On Thu, 2 Feb 2017, Kirill Batuzov wrote:

> The goal of these patch series is to set up an infrastructure to emulate
> guest vector operations using host vector operations. Preliminary
> experiments show that simply translating loads and stores increases
> performance of x264 video codec by 10%. The performance of a gcc vectorized
> for loop increased 2x.
> 
> To be able to emulate guest vector operations using host vector operations,
> several things need to be done.
> 
> 1. Corresponding vector types should be added to TCG. These series add
> TCG_v128 and TCG_v64. I've made TCG_v64 a different type than TCG_i64
> because it usually needs to be allocated to different registers and
> supports different operations.
> 
> 2. Load/store operations for these new types need to be implemented.
> 
> 3. For seamless transition from current model to a new one we need to
> handle cases where memory occupied by global variable can be accessed via
> pointer to the CPUArchState structure. A very simple conservative alias
> analysis has been added to do it. This analysis tracks memory loads and
> stores that overlap with fields of CPUArchState and provides this
> information to the register allocator. The allocator then spills and
> reloads affected globals when needed.
> 
> 4. Allow overlapping globals. For scalar registers this is a rare case, and
> overlapping registers can ba handled as a single one (ah, al, ax, eax,
> rax). In ARM every Q-register consists of two D-register each consisting of
> two S-registers. Handling 4 S-registers as one because they are parts of
> the same Q-register is way too inefficient.
> 
> 5. Add new memory addressing mode to MMU code for large accesses and create
> needed helpers. Only 128-bit vectors have been handled for now.
> 
> 6. Create TCG opcodes for vector operations. Only addition has beed handled
> in these series. Each operation has a wrapper that checks if the backend
> supports the corresponding operation or not. In one case the vector opcode
> is generated, in the other the operation is emulated with scalar
> operations. The emulation code is generated inline for performance reasons
> (there is a huge performance difference between inline generation
> and calling a helper). As a positive side effect this will eventually allow
>  to merge similar emulation code for vector instructions from different
> frontends to target-independent implementation.
> 
> 7. Use new operations in the frontend (ARM was used in these series).
> 
> 8. Support new operations in the backend (x86_64 was used in these series).
> 
> For experiments I have used ARM guest on x86_64 host. I wanted some pair of
> different architectures with vector extensions both. ARM and x86_64 pair
> fits well.
> 
> v1 -> v2:
>  - represent v128 type with smaller types when it is not supported by the host
>  - detect AVX support and use AVX instructions when available
>  - tcg/README updated
>  - generate two v64 adds instead of one v128 when applicable
>  - rebased to newer master
>  - overlap detection for temps added (it needs to be explicitly called from
>    <arch>_translate_init)
>  - the stack is used to temporary store 128 bit variables to memory
>    (instead of the TCGContext field)
> 
> v2 -> v2.1
>  - automatic build failure fixed
> 
> Outstanding issues:
>  - qemu_ld_v128 and qemu_st_v128 do not generate fallback code if the host
>    does not support 128 bit registers. The reason is that I do not know how to
>    handle the host/guest different endianness (whether do we swap only bytes
>    in elements or whole vectors?). Different targets seem to have different
>    ideas on how this should be done.
>

Ping?

-- 
Kirill