[PATCH 00/27] tcg: Simplify temporary usage

Richard Henderson posted 27 patches 1 year, 2 months ago
There is a newer version of this series
docs/devel/tcg-ops.rst                      | 103 ++++----
target/hexagon/idef-parser/README.rst       |   4 +-
include/exec/gen-icount.h                   |   8 +-
include/exec/translator.h                   |   4 +-
include/tcg/tcg-op.h                        |   7 +-
include/tcg/tcg.h                           |  64 ++---
target/arm/translate-a64.h                  |   1 -
target/hexagon/gen_tcg.h                    |   4 +-
accel/tcg/plugin-gen.c                      |  33 +--
accel/tcg/translate-all.c                   |   2 +-
accel/tcg/translator.c                      |   6 +-
target/alpha/translate.c                    |   2 +-
target/arm/translate-a64.c                  |   6 -
target/arm/translate-sve.c                  |  38 +--
target/arm/translate.c                      |   8 +-
target/avr/translate.c                      |   2 +-
target/cris/translate.c                     |   8 +-
target/hexagon/genptr.c                     |  16 +-
target/hexagon/idef-parser/parser-helpers.c |   4 +-
target/hexagon/translate.c                  |   4 +-
target/hppa/translate.c                     |   5 +-
target/i386/tcg/translate.c                 |  29 +--
target/loongarch/translate.c                |   2 +-
target/m68k/translate.c                     |   2 +-
target/microblaze/translate.c               |   2 +-
target/mips/tcg/translate.c                 |  59 ++---
target/nios2/translate.c                    |   2 +-
target/openrisc/translate.c                 |   2 +-
target/ppc/translate.c                      |   8 +-
target/riscv/translate.c                    |   2 +-
target/rx/translate.c                       |   2 +-
target/s390x/tcg/translate.c                |   2 +-
target/sh4/translate.c                      |   2 +-
target/sparc/translate.c                    |   2 +-
target/tricore/translate.c                  |   2 +-
target/xtensa/translate.c                   |  18 +-
tcg/optimize.c                              |   2 +-
tcg/tcg-op-gvec.c                           | 270 ++++++++++----------
tcg/tcg-op.c                                | 258 +++++++++----------
tcg/tcg.c                                   | 270 +++++++++++---------
target/cris/translate_v10.c.inc             |  10 +-
target/mips/tcg/nanomips_translate.c.inc    |   4 +-
target/ppc/translate/spe-impl.c.inc         |   8 +-
target/ppc/translate/vmx-impl.c.inc         |   4 +-
target/hexagon/README                       |   8 +-
target/hexagon/gen_tcg_funcs.py             |  18 +-
46 files changed, 640 insertions(+), 677 deletions(-)
[PATCH 00/27] tcg: Simplify temporary usage
Posted by Richard Henderson 1 year, 2 months ago
Based-on: 20230126043824.54819-1-richard.henderson@linaro.org
("[PATCH v5 00/36] tcg: Support for Int128 with helpers")

The biggest pitfall for new users of TCG is the fact that "normal"
temporaries die at branches, and we must therefore use a different
"local" temporary in that case.

The following patch set changes that, so that the "normal" temporary
is the one that lives across branches, and there is a special temporary
that dies at the end of the extended basic block, and this special
case is reserved for tcg internals.

TEMP_LOCAL is renamed TEMP_TB, which I believe to be more explicit and
less confusing.  TEMP_NORMAL is removed entirely.

I thought about putting in a proper full-power liveness analysis pass.
This would have eliminated the differences between all non-global
temporaries, and would have noticed when TEMP_LOCAL finally dies
within a translation and avoid any final writeback.
But I came to the conclusion that it was too expensive in runtime,
and so retaining some distinction in the types was required.

In addition, I found that the usage of temps within plugin-gen.c
(9 per guest memory operation) meant that we *must* have some form
of temp that can be re-used.  (There is one x86 instruction which
generates 62 memory operations; 62 * 9 == 558, which is larger than
our current TCG_MAX_TEMPS.)

However I did add a new liveness pass which, with a single pass over
the opcode stream, can see that a TEMP_LOCAL is only live within a
single extended basic block, and thus may be transformed to TEMP_EBB.

With this, and by not recycling TEMP_LOCAL, we can get identical code
out of the backend even when changing the front end translators are
adjusted to use TEMP_LOCAL for everything.

Benchmarking one test case, qemu-arm linux-test, the new liveness pass
comes in at about 1.6% on perf, but I can't see any difference in
wall clock time before and after the patch set.


r~


Richard Henderson (27):
  tcg: Adjust TCGContext.temps_in_use check
  accel/tcg: Pass max_insn to gen_intermediate_code by pointer
  accel/tcg: Use more accurate max_insns for tb_overflow
  tcg: Remove branch-to-next regardless of reference count
  tcg: Rename TEMP_LOCAL to TEMP_TB
  tcg: Add liveness_pass_0
  tcg: Remove TEMP_NORMAL
  tcg: Pass TCGTempKind to tcg_temp_new_internal
  tcg: Add tcg_temp_ebb_new_{i32,i64,ptr}
  tcg: Add tcg_gen_movi_ptr
  tcg: Use tcg_temp_ebb_new_* in tcg/
  accel/tcg/plugin: Use tcg_temp_ebb_*
  accel/tcg/plugin: Tidy plugin_gen_disable_mem_helpers
  tcg: Don't re-use TEMP_TB temporaries
  tcg: Change default temp lifetime to TEMP_TB
  target/arm: Drop copies in gen_sve_{ldr,str}
  target/arm: Don't use tcg_temp_local_new_*
  target/cris: Don't use tcg_temp_local_new
  target/hexagon: Don't use tcg_temp_local_new_*
  target/hppa: Don't use tcg_temp_local_new
  target/i386: Don't use tcg_temp_local_new
  target/mips: Don't use tcg_temp_local_new
  target/ppc: Don't use tcg_temp_local_new
  target/xtensa: Don't use tcg_temp_local_new_*
  exec/gen-icount: Don't use tcg_temp_local_new_i32
  tcg: Remove tcg_temp_local_new_*, tcg_const_local_*
  tcg: Update docs/devel/tcg-ops.rst for temporary changes

 docs/devel/tcg-ops.rst                      | 103 ++++----
 target/hexagon/idef-parser/README.rst       |   4 +-
 include/exec/gen-icount.h                   |   8 +-
 include/exec/translator.h                   |   4 +-
 include/tcg/tcg-op.h                        |   7 +-
 include/tcg/tcg.h                           |  64 ++---
 target/arm/translate-a64.h                  |   1 -
 target/hexagon/gen_tcg.h                    |   4 +-
 accel/tcg/plugin-gen.c                      |  33 +--
 accel/tcg/translate-all.c                   |   2 +-
 accel/tcg/translator.c                      |   6 +-
 target/alpha/translate.c                    |   2 +-
 target/arm/translate-a64.c                  |   6 -
 target/arm/translate-sve.c                  |  38 +--
 target/arm/translate.c                      |   8 +-
 target/avr/translate.c                      |   2 +-
 target/cris/translate.c                     |   8 +-
 target/hexagon/genptr.c                     |  16 +-
 target/hexagon/idef-parser/parser-helpers.c |   4 +-
 target/hexagon/translate.c                  |   4 +-
 target/hppa/translate.c                     |   5 +-
 target/i386/tcg/translate.c                 |  29 +--
 target/loongarch/translate.c                |   2 +-
 target/m68k/translate.c                     |   2 +-
 target/microblaze/translate.c               |   2 +-
 target/mips/tcg/translate.c                 |  59 ++---
 target/nios2/translate.c                    |   2 +-
 target/openrisc/translate.c                 |   2 +-
 target/ppc/translate.c                      |   8 +-
 target/riscv/translate.c                    |   2 +-
 target/rx/translate.c                       |   2 +-
 target/s390x/tcg/translate.c                |   2 +-
 target/sh4/translate.c                      |   2 +-
 target/sparc/translate.c                    |   2 +-
 target/tricore/translate.c                  |   2 +-
 target/xtensa/translate.c                   |  18 +-
 tcg/optimize.c                              |   2 +-
 tcg/tcg-op-gvec.c                           | 270 ++++++++++----------
 tcg/tcg-op.c                                | 258 +++++++++----------
 tcg/tcg.c                                   | 270 +++++++++++---------
 target/cris/translate_v10.c.inc             |  10 +-
 target/mips/tcg/nanomips_translate.c.inc    |   4 +-
 target/ppc/translate/spe-impl.c.inc         |   8 +-
 target/ppc/translate/vmx-impl.c.inc         |   4 +-
 target/hexagon/README                       |   8 +-
 target/hexagon/gen_tcg_funcs.py             |  18 +-
 46 files changed, 640 insertions(+), 677 deletions(-)

-- 
2.34.1
Re: [PATCH 00/27] tcg: Simplify temporary usage
Posted by Richard Henderson 1 year, 2 months ago
Ping for the 9 patches lacking review.

r~

On 1/30/23 10:59, Richard Henderson wrote:
> Based-on: 20230126043824.54819-1-richard.henderson@linaro.org
> ("[PATCH v5 00/36] tcg: Support for Int128 with helpers")
> 
> The biggest pitfall for new users of TCG is the fact that "normal"
> temporaries die at branches, and we must therefore use a different
> "local" temporary in that case.
> 
> The following patch set changes that, so that the "normal" temporary
> is the one that lives across branches, and there is a special temporary
> that dies at the end of the extended basic block, and this special
> case is reserved for tcg internals.
> 
> TEMP_LOCAL is renamed TEMP_TB, which I believe to be more explicit and
> less confusing.  TEMP_NORMAL is removed entirely.
> 
> I thought about putting in a proper full-power liveness analysis pass.
> This would have eliminated the differences between all non-global
> temporaries, and would have noticed when TEMP_LOCAL finally dies
> within a translation and avoid any final writeback.
> But I came to the conclusion that it was too expensive in runtime,
> and so retaining some distinction in the types was required.
> 
> In addition, I found that the usage of temps within plugin-gen.c
> (9 per guest memory operation) meant that we *must* have some form
> of temp that can be re-used.  (There is one x86 instruction which
> generates 62 memory operations; 62 * 9 == 558, which is larger than
> our current TCG_MAX_TEMPS.)
> 
> However I did add a new liveness pass which, with a single pass over
> the opcode stream, can see that a TEMP_LOCAL is only live within a
> single extended basic block, and thus may be transformed to TEMP_EBB.
> 
> With this, and by not recycling TEMP_LOCAL, we can get identical code
> out of the backend even when changing the front end translators are
> adjusted to use TEMP_LOCAL for everything.
> 
> Benchmarking one test case, qemu-arm linux-test, the new liveness pass
> comes in at about 1.6% on perf, but I can't see any difference in
> wall clock time before and after the patch set.
> 
> 
> r~
> 
> 
> Richard Henderson (27):
>    tcg: Adjust TCGContext.temps_in_use check
>    accel/tcg: Pass max_insn to gen_intermediate_code by pointer
>    accel/tcg: Use more accurate max_insns for tb_overflow
>    tcg: Remove branch-to-next regardless of reference count
>    tcg: Rename TEMP_LOCAL to TEMP_TB
>    tcg: Add liveness_pass_0
>    tcg: Remove TEMP_NORMAL
>    tcg: Pass TCGTempKind to tcg_temp_new_internal
>    tcg: Add tcg_temp_ebb_new_{i32,i64,ptr}
>    tcg: Add tcg_gen_movi_ptr
>    tcg: Use tcg_temp_ebb_new_* in tcg/
>    accel/tcg/plugin: Use tcg_temp_ebb_*
>    accel/tcg/plugin: Tidy plugin_gen_disable_mem_helpers
>    tcg: Don't re-use TEMP_TB temporaries
>    tcg: Change default temp lifetime to TEMP_TB
>    target/arm: Drop copies in gen_sve_{ldr,str}
>    target/arm: Don't use tcg_temp_local_new_*
>    target/cris: Don't use tcg_temp_local_new
>    target/hexagon: Don't use tcg_temp_local_new_*
>    target/hppa: Don't use tcg_temp_local_new
>    target/i386: Don't use tcg_temp_local_new
>    target/mips: Don't use tcg_temp_local_new
>    target/ppc: Don't use tcg_temp_local_new
>    target/xtensa: Don't use tcg_temp_local_new_*
>    exec/gen-icount: Don't use tcg_temp_local_new_i32
>    tcg: Remove tcg_temp_local_new_*, tcg_const_local_*
>    tcg: Update docs/devel/tcg-ops.rst for temporary changes
> 
>   docs/devel/tcg-ops.rst                      | 103 ++++----
>   target/hexagon/idef-parser/README.rst       |   4 +-
>   include/exec/gen-icount.h                   |   8 +-
>   include/exec/translator.h                   |   4 +-
>   include/tcg/tcg-op.h                        |   7 +-
>   include/tcg/tcg.h                           |  64 ++---
>   target/arm/translate-a64.h                  |   1 -
>   target/hexagon/gen_tcg.h                    |   4 +-
>   accel/tcg/plugin-gen.c                      |  33 +--
>   accel/tcg/translate-all.c                   |   2 +-
>   accel/tcg/translator.c                      |   6 +-
>   target/alpha/translate.c                    |   2 +-
>   target/arm/translate-a64.c                  |   6 -
>   target/arm/translate-sve.c                  |  38 +--
>   target/arm/translate.c                      |   8 +-
>   target/avr/translate.c                      |   2 +-
>   target/cris/translate.c                     |   8 +-
>   target/hexagon/genptr.c                     |  16 +-
>   target/hexagon/idef-parser/parser-helpers.c |   4 +-
>   target/hexagon/translate.c                  |   4 +-
>   target/hppa/translate.c                     |   5 +-
>   target/i386/tcg/translate.c                 |  29 +--
>   target/loongarch/translate.c                |   2 +-
>   target/m68k/translate.c                     |   2 +-
>   target/microblaze/translate.c               |   2 +-
>   target/mips/tcg/translate.c                 |  59 ++---
>   target/nios2/translate.c                    |   2 +-
>   target/openrisc/translate.c                 |   2 +-
>   target/ppc/translate.c                      |   8 +-
>   target/riscv/translate.c                    |   2 +-
>   target/rx/translate.c                       |   2 +-
>   target/s390x/tcg/translate.c                |   2 +-
>   target/sh4/translate.c                      |   2 +-
>   target/sparc/translate.c                    |   2 +-
>   target/tricore/translate.c                  |   2 +-
>   target/xtensa/translate.c                   |  18 +-
>   tcg/optimize.c                              |   2 +-
>   tcg/tcg-op-gvec.c                           | 270 ++++++++++----------
>   tcg/tcg-op.c                                | 258 +++++++++----------
>   tcg/tcg.c                                   | 270 +++++++++++---------
>   target/cris/translate_v10.c.inc             |  10 +-
>   target/mips/tcg/nanomips_translate.c.inc    |   4 +-
>   target/ppc/translate/spe-impl.c.inc         |   8 +-
>   target/ppc/translate/vmx-impl.c.inc         |   4 +-
>   target/hexagon/README                       |   8 +-
>   target/hexagon/gen_tcg_funcs.py             |  18 +-
>   46 files changed, 640 insertions(+), 677 deletions(-)
>
Re: [PATCH 00/27] tcg: Simplify temporary usage
Posted by Emilio Cota 1 year, 2 months ago
Hi Richard,

On Mon, Jan 30, 2023 at 10:59:07 -1000, Richard Henderson wrote:
(snip)
> With this, and by not recycling TEMP_LOCAL, we can get identical code
> out of the backend even when changing the front end translators are
> adjusted to use TEMP_LOCAL for everything.
> 
> Benchmarking one test case, qemu-arm linux-test, the new liveness pass
> comes in at about 1.6% on perf, but I can't see any difference in
> wall clock time before and after the patch set.

I ran yesterday linux-user SPEC06 benchmarks from your tcg-life branch.
I do see perf regressions for two workloads (sjeng and xalancbmk).
With perf(1) I see liveness_pass* are at 0.00%, so I wonder: is it
possible that the emitted code isn't quite the same?

Happy to run more tests if helpful. Results below.

Thanks,
		Emilio

- bar chart, png: https://postimg.cc/ZCTkbYS9
- bar chart, txt:

                                                       Speedup of tcg-life (de6361f6) over master (ae2b5d83)
                                                           Host: AMD Ryzen 7 PRO 5850U. Compiler: gcc12
  1.03 +----------------------------------------------------------------------------------------------------------------------------------------------------+
  1.02 |-+.............................................................................................|..................................................+-|
       |                                                                                               |                                                    |
  1.01 |-+.............................................................................................|..................................................+-|
     1 |-+.....**+-+*.....*+-+**......+-+...........................**+-+**..............**+-+**....***|**...............**+-+**..........................+-|
       |       *    *     *+-+ *    **+-+**      +-+       +-+      * +-+ *              *     *    *  | *               * +-+ *                +-+         |
  0.99 |-+.....*....*.....*....*....*.....*....***|**.....**|***....*.....*..............*.....*....*..|.*......+-+......*.....*...............*+-+**.....+-|
  0.98 |-+.....*....*.....*....*....*.....*....*.+-+*.....*+-+.*....*.....*....**+-+*....*.....*....*..|.*.......|.......*.....*...............*....*.....+-|
       |       *    *     *    *    *     *    *    *     *    *    *     *    * +-+*    *     *    * +-+*     **|***    *     *               *    *       |
  0.97 |-+.....*....*.....*....*....*.....*....*....*.....*....*....*.....*....*....*....*.....*....*....*.....*.|..*....*.....*......+-+......*....*.....+-|
  0.96 |-+.....*....*.....*....*....*.....*....*....*.....*....*....*.....*....*....*....*.....*....*....*.....*+-+.*....*.....*.......|.......*....*.....+-|
       |       *    *     *    *    *     *    *    *     *    *    *     *    *    *    *     *    *    *     *    *    *     *    ***|**     *    *       |
  0.95 |-+.....*....*.....*....*....*.....*....*....*.....*....*....*.....*....*....*....*.....*....*....*.....*....*....*.....*....*.+-+*.....*....*.....+-|
  0.94 +----------------------------------------------------------------------------------------------------------------------------------------------------+
     400.perlbench 401.bzip2    403.gcc    429.mcf 445.gobmk  456.hmmer  458.s462.libquantum464.h264re471.omnetpp  473.ast483.xalancbmk   geomean

- Raw data for the bar chart:

  + baseline:
# benchmark	mean	stdev	raw
400.perlbench	94.4343747333333	0.331828752549838	94.131272,94.421923,94.34074,94.747239,94.982504,94.602928,93.743109,94.077325,94.220688,94.505739,94.598781,94.779386,94.177626,94.811701,94.37466
401.bzip2	83.0563643333333	0.270338451882521	83.603378,82.784967,82.766427,83.703505,83.018864,82.859924,83.128875,83.052816,82.921046,82.809962,83.027326,83.122502,83.099782,83.005817,82.940274
403.gcc	2.8751204	0.0183794528241263	2.872445,2.886974,2.884226,2.877824,2.871482,2.927202,2.864385,2.86503,2.855154,2.856129,2.86079,2.861818,2.887109,2.867046,2.889192
429.mcf	13.527965	0.0849965442919382	13.498908,13.494126,13.469952,13.606229,13.604864,13.513806,13.472737,13.572454,13.407602,13.70441,13.487249,13.562176,13.503575,13.39053,13.630857
445.gobmk	279.017610333333	1.91925368167126	279.808944,278.057813,278.831984,279.388752,276.801944,280.078062,278.675088,277.094009,279.452037,278.832294,278.843473,279.407613,275.879438,284.430909,279.681795
456.hmmer	103.296133533333	0.38166706019324	103.33233,102.944119,103.083766,103.001765,104.302275,103.329573,103.720265,103.537909,102.931565,103.008669,102.974703,103.5448,103.484963,102.958228,103.287073
458.sjeng	332.387649666667	0.868297133920158	331.71233,333.413204,333.367836,332.57489,331.818019,331.14369,333.848697,333.135605,332.878587,332.069454,332.003468,332.692292,331.01894,331.426129,332.711604
462.libquantum	4.12260253333333	0.00508688019554322	4.121422,4.116031,4.131564,4.113532,4.117144,4.124039,4.128896,4.118079,4.121929,4.124027,4.125302,4.124549,4.119102,4.125368,4.128054
464.h264ref	244.092639533333	13.3464074285764	239.569243,240.187437,240.760271,241.483515,240.772044,241.492141,240.530232,240.449723,240.679955,240.464527,241.3703,292.302111,240.254072,240.490477,240.583545
471.omnetpp	261.340260533333	3.7694119109844	263.463533,259.640839,260.834291,263.816131,256.877675,259.833289,258.708458,261.868763,260.75424,265.656161,257.900388,265.734187,256.747515,270.004887,258.263551
473.astar	142.966170866667	0.481395129184935	142.636087,142.675786,141.895549,143.236359,142.892086,142.325069,143.267024,143.910479,143.279771,142.666683,143.11241,143.15343,143.041394,143.391831,143.008605
483.xalancbmk	401.605619866667	3.99007996364547	401.101824,400.266261,396.474675,406.136427,404.400767,406.339383,397.442574,409.241015,399.084079,399.828507,402.585078,394.89061,404.722299,401.654323,399.916476

  + tcg-life:
# benchmark	mean	stdev	raw
400.perlbench	94.1968828666667	0.352661861692484	94.726037,94.169276,93.893696,94.224617,94.613626,94.471446,94.198829,94.616742,93.845426,93.435601,94.040449,94.574709,94.105065,94.007179,94.030545
401.bzip2	83.0027554666667	0.214192109333076	83.181646,83.299212,83.342217,82.848151,82.808142,82.888099,82.942223,82.777883,82.739787,82.770313,83.01728,83.327844,83.201232,82.905666,82.991637
403.gcc	2.87870153333333	0.0304401106926527	2.860922,2.867219,2.860457,2.888637,2.879031,2.87397,2.882131,2.896422,2.865079,2.870739,2.847357,2.864518,2.901592,2.849287,2.973162
429.mcf	13.6952006666667	0.155876459519191	13.734646,13.746608,13.528171,13.577692,13.534005,13.65201,13.947822,13.541465,13.710553,13.787918,13.521862,13.997184,13.546621,13.848357,13.753096
445.gobmk	282.1855452	1.68500895181812	281.715494,282.875207,282.073035,281.660872,281.96679,278.912804,281.078281,283.777396,283.485664,278.564193,283.900278,283.662609,282.781748,284.176339,282.152468
456.hmmer	103.3804904	0.554303069916862	103.077106,103.013059,103.247046,105.192431,103.221722,102.99502,103.787524,103.086281,103.213953,103.048905,103.042041,103.664296,103.278652,103.445109,103.394211
458.sjeng	339.3596132	3.77963378278808	341.545293,341.249426,336.87165,343.192545,338.087093,339.691087,337.29754,341.586473,336.838538,345.476397,339.196873,342.773593,337.546389,329.312139,339.729162
462.libquantum	4.1225128	0.00546800475754836	4.112292,4.119043,4.119803,4.129127,4.117612,4.122837,4.120172,4.121449,4.127452,4.113505,4.129305,4.128303,4.126079,4.127113,4.1236
464.h264ref	243.447219066667	0.924288945630674	241.71547,242.724405,242.751474,243.730945,243.889673,243.254516,244.328523,244.374465,243.447008,245.45696,243.256098,242.348791,243.440131,242.895642,244.094185
471.omnetpp	268.2971082	5.67916415832786	271.509491,273.656661,274.294363,266.501929,272.7864,267.868119,271.032049,267.085038,256.124737,270.430985,271.586944,256.427087,268.23723,264.012334,272.903256
473.astar	142.842279266667	0.482819143874435	142.820726,142.742386,143.237814,143.241978,142.761549,142.026643,143.042933,142.849644,143.035134,142.150158,142.066603,143.086841,143.701693,142.553374,143.316713
483.xalancbmk	420.324755133333	8.22679014442942	424.925688,433.128404,415.710656,423.156208,428.067657,426.100068,429.6215,412.083569,411.921022,410.749722,407.134107,414.478705,416.110115,430.104758,421.579148

I then ran perf record on xalancbmk before/after:

$ time for suffix in gcc12; do for tag in tcg-life-baseline tcg-life; do perf record -o /tmp/$tag-$suffix.perf.data -k 1 taskset -c 2 ./spec06.pl --iterations=1 --size=train --config=aarch64 --show-raw run ~/src/dbt-bench/out/$tag-$suffix/bin/qemu-aarch64 ~/src/spec/spec06-aarch64 xalancbmk; done; done
483.xalancbmk (#1/1)
run_base_train_aarch64.0068.qemu-aarch64: qemu-aarch64 Xalan_base.aarch64   -v allbooks.xml xalanc.xsl:         410.191153s

# benchmark     mean    stdev   raw
483.xalancbmk   410.191153      0       410.191153
[ perf record: Woken up 251 times to write data ]
[ perf record: Captured and wrote 62.629 MB /tmp/tcg-life-baseline-gcc12.perf.data (1641030 samples) ]
483.xalancbmk (#1/1)
run_base_train_aarch64.0069.qemu-aarch64: qemu-aarch64 Xalan_base.aarch64   -v allbooks.xml xalanc.xsl:         464.428108s

# benchmark     mean    stdev   raw
483.xalancbmk   464.428108      0       464.428108
[ perf record: Woken up 284 times to write data ]
[ perf record: Captured and wrote 70.905 MB /tmp/tcg-life-gcc12.perf.data (1857959 samples) ]

real    14m35.863s
user    14m34.897s
sys     0m0.925s

- perf report (baseline):
# Total Lost Samples: 0
#
# Samples: 1M of event 'cycles'
# Event count (approx.): 1797955092780
#
# Overhead  Command       Shared Object            Symbol                                      
# ........  ............  .......................  ............................................
#
    43.83%  qemu-aarch64  qemu-aarch64             [.] helper_lookup_tb_ptr
     5.56%  qemu-aarch64  qemu-aarch64             [.] cpu_get_tb_cpu_state
     2.23%  qemu-aarch64  qemu-aarch64             [.] qht_lookup_custom
     1.57%  qemu-aarch64  qemu-aarch64             [.] tb_htable_lookup
     1.29%  qemu-aarch64  qemu-aarch64             [.] tb_lookup_cmp
     0.72%  qemu-aarch64  qemu-aarch64             [.] interval_tree_iter_first
     0.28%  qemu-aarch64  qemu-aarch64             [.] helper_vfp_cmpd_a64
     0.27%  qemu-aarch64  [JIT] tid 561758         [.] 0x00007f79244b2a43
     0.24%  qemu-aarch64  [JIT] tid 561758         [.] 0x00007f792449c058
     0.20%  qemu-aarch64  qemu-aarch64             [.] page_get_flags
     0.20%  qemu-aarch64  [JIT] tid 561758         [.] 0x00007f7924349c22
     0.19%  qemu-aarch64  [JIT] tid 561758         [.] 0x00007f7924349c40
     0.18%  qemu-aarch64  [JIT] tid 561758         [.] 0x00007f7924349203
     0.17%  qemu-aarch64  [JIT] tid 561758         [.] 0x00007f7923e09b03
     0.17%  qemu-aarch64  qemu-aarch64             [.] helper_vfp_cmped_a64
     0.15%  qemu-aarch64  [JIT] tid 561758         [.] 0x00007f7923e9f965
     0.15%  qemu-aarch64  [JIT] tid 561758         [.] 0x00007f7924552f2b
     0.15%  qemu-aarch64  qemu-aarch64             [.] float64_hs_compare
     0.14%  qemu-aarch64  [JIT] tid 561758         [.] 0x00007f79244f7003
     0.14%  qemu-aarch64  [JIT] tid 561758         [.] 0x00007f7924552a03
     0.14%  qemu-aarch64  [JIT] tid 561758         [.] 0x00007f7924349243
     0.14%  qemu-aarch64  [JIT] tid 561758         [.] 0x00007f7924546df6
     0.13%  qemu-aarch64  qemu-aarch64             [.] get_page_addr_code_hostp
     0.12%  qemu-aarch64  [JIT] tid 561758         [.] 0x00007f792454de7b
     0.12%  qemu-aarch64  [JIT] tid 561758         [.] 0x00007f7924555a85
     0.12%  qemu-aarch64  [JIT] tid 561758         [.] 0x00007f792454f465
     0.12%  qemu-aarch64  qemu-aarch64             [.] float64_add
     0.12%  qemu-aarch64  [JIT] tid 561758         [.] 0x00007f792439af03
     0.11%  qemu-aarch64  [JIT] tid 561758         [.] 0x00007f7924554b43
     [...]
     0.00%  qemu-aarch64  qemu-aarch64             [.] liveness_pass_1

- perf report (tcg-life):
# Total Lost Samples: 0
#
# Samples: 1M of event 'cycles'
# Event count (approx.): 2035140825489
#
# Overhead  Command       Shared Object            Symbol                                      
# ........  ............  .......................  ............................................
#
    43.00%  qemu-aarch64  qemu-aarch64             [.] helper_lookup_tb_ptr
     5.73%  qemu-aarch64  qemu-aarch64             [.] cpu_get_tb_cpu_state
     2.16%  qemu-aarch64  qemu-aarch64             [.] qht_lookup_custom
     1.58%  qemu-aarch64  qemu-aarch64             [.] tb_htable_lookup
     1.10%  qemu-aarch64  qemu-aarch64             [.] tb_lookup_cmp
     0.40%  qemu-aarch64  qemu-aarch64             [.] interval_tree_iter_first
     0.26%  qemu-aarch64  [JIT] tid 562312         [.] 0x00007fdeb37d4018
     0.25%  qemu-aarch64  qemu-aarch64             [.] helper_vfp_cmpd_a64
     0.22%  qemu-aarch64  [JIT] tid 562312         [.] 0x00007fdeb387ecb6
     0.21%  qemu-aarch64  qemu-aarch64             [.] page_get_flags
     0.19%  qemu-aarch64  [JIT] tid 562312         [.] 0x00007fdeb3141b03
     0.17%  qemu-aarch64  [JIT] tid 562312         [.] 0x00007fdeb3681d62
     0.16%  qemu-aarch64  [JIT] tid 562312         [.] 0x00007fdeb388ae2b
     0.16%  qemu-aarch64  [JIT] tid 562312         [.] 0x00007fdeb37ea9c3
     0.16%  qemu-aarch64  qemu-aarch64             [.] helper_vfp_cmped_a64
     0.15%  qemu-aarch64  qemu-aarch64             [.] get_page_addr_code_hostp
     0.15%  qemu-aarch64  [JIT] tid 562312         [.] 0x00007fdeb3887325
     0.15%  qemu-aarch64  [JIT] tid 562312         [.] 0x00007fdeb389ddc3
     0.14%  qemu-aarch64  [JIT] tid 562312         [.] 0x00007fdeb3681d80
     0.14%  qemu-aarch64  qemu-aarch64             [.] float64_hs_compare
     0.13%  qemu-aarch64  [JIT] tid 562312         [.] 0x00007fdeb36d2f83
     0.12%  qemu-aarch64  [JIT] tid 562312         [.] 0x00007fdeb3885b65
     0.12%  qemu-aarch64  [JIT] tid 562312         [.] 0x00007fdeb37eab43
     0.12%  qemu-aarch64  [JIT] tid 562312         [.] 0x00007fdeb388a903
     0.12%  qemu-aarch64  [JIT] tid 562312         [.] 0x00007fdeb31d7925
     0.11%  qemu-aarch64  qemu-aarch64             [.] parts64_float_to_sint
     0.11%  qemu-aarch64  [JIT] tid 562312         [.] 0x00007fdeb3885d3b
     0.11%  qemu-aarch64  [JIT] tid 562312         [.] 0x00007fdeb3681383
     0.11%  qemu-aarch64  [JIT] tid 562312         [.] 0x00007fdeb388a683
     [...]
     0.00%  qemu-aarch64  qemu-aarch64             [.] liveness_pass_1
     0.00%  qemu-aarch64  qemu-aarch64             [.] liveness_pass_0
Re: [PATCH 00/27] tcg: Simplify temporary usage
Posted by Richard Henderson 1 year, 2 months ago
On 2/10/23 02:35, Emilio Cota wrote:
> I ran yesterday linux-user SPEC06 benchmarks from your tcg-life branch.
> I do see perf regressions for two workloads (sjeng and xalancbmk).
> With perf(1) I see liveness_pass* are at 0.00%, so I wonder: is it
> possible that the emitted code isn't quite the same?

Everything that I checked by hand was the same, but it's possible.
It's a tedious process.  You'd definitely want to turn off ASR.

My current branch has __attribute__((noreturn)) added to all of the liveness passes, so 
that they don't get folded into tcg_gen_code.  But I still would expect 0%.

r~
Re: [PATCH 00/27] tcg: Simplify temporary usage
Posted by Emilio Cota 1 year, 1 month ago
On Wed, Feb 15, 2023 at 20:15:37 -1000, Richard Henderson wrote:
> On 2/10/23 02:35, Emilio Cota wrote:
> > I ran yesterday linux-user SPEC06 benchmarks from your tcg-life branch.
> > I do see perf regressions for two workloads (sjeng and xalancbmk).
> > With perf(1) I see liveness_pass* are at 0.00%, so I wonder: is it
> > possible that the emitted code isn't quite the same?
> 
> Everything that I checked by hand was the same, but it's possible.
> It's a tedious process.  You'd definitely want to turn off ASR.

I've checked with -jitdump and perf whether there was any difference
in the generated code before vs. after for the most common TBs.
They were identical.

Benchmarking without ASR didn't make a difference, unfortunately.

> My current branch has __attribute__((noreturn)) added to all of the liveness
> passes, so that they don't get folded into tcg_gen_code.  But I still would
> expect 0%.

I'll bisect the series in the next few days see exactly where
the perf regression begins so that at least we know where to look.

Thanks,
		Emilio