util: optimise flush_idcache_range when the ppc host has coherent icache

[PATCH] util: optimise flush_idcache_range when the ppc host has coherent icache

Posted by Nicholas Piggin 3 years, 8 months ago

dcache writeback and icache invalidate is not required when icache is
coherent, a shorter fixed-length sequence can be used which just has to
flush and re-fetch instructions that were in-flight.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---

I haven't been able to measure a significant performance difference
with this, qemu isn't flushing large ranges frequently so the old sequence
is not that slow.

 include/qemu/cacheflush.h |  4 ++++
 util/cacheflush.c         |  9 +++++++++
 util/cacheinfo.c          | 16 ++++++++++++++--
 3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/include/qemu/cacheflush.h b/include/qemu/cacheflush.h
index ae20bcda73..f65349ce3c 100644
--- a/include/qemu/cacheflush.h
+++ b/include/qemu/cacheflush.h
@@ -28,6 +28,10 @@ static inline void flush_idcache_range(uintptr_t rx, uintptr_t rw, size_t len)
 
 #else
 
+#if defined(__powerpc__)
+extern bool have_coherent_icache;
+#endif
+
 void flush_idcache_range(uintptr_t rx, uintptr_t rw, size_t len);
 
 #endif
diff --git a/util/cacheflush.c b/util/cacheflush.c
index 4b57186d89..15060f78b8 100644
--- a/util/cacheflush.c
+++ b/util/cacheflush.c
@@ -108,7 +108,16 @@ void flush_idcache_range(uintptr_t rx, uintptr_t rw, size_t len)
     size_t isize = qemu_icache_linesize;
 
     b = rw & ~(dsize - 1);
+
+    if (have_coherent_icache) {
+        asm volatile ("sync" : : : "memory");
+        asm volatile ("icbi 0,%0" : : "r"(b) : "memory");
+        asm volatile ("isync" : : : "memory");
+        return;
+    }
+
     e = (rw + len + dsize - 1) & ~(dsize - 1);
+
     for (p = b; p < e; p += dsize) {
         asm volatile ("dcbst 0,%0" : : "r"(p) : "memory");
     }
diff --git a/util/cacheinfo.c b/util/cacheinfo.c
index ab1644d490..b632ff47ae 100644
--- a/util/cacheinfo.c
+++ b/util/cacheinfo.c
@@ -133,18 +133,30 @@ static void arch_cache_info(int *isize, int *dsize)
     }
 }
 
-#elif defined(_ARCH_PPC) && defined(__linux__)
-# include "elf.h"
+#elif defined(__powerpc__)
 
+bool have_coherent_icache = false;
+
+# if defined(_ARCH_PPC) && defined(__linux__)
+#  include "elf.h"
 static void arch_cache_info(int *isize, int *dsize)
 {
+#  ifdef PPC_FEATURE_ICACHE_SNOOP
+    unsigned long hwcap = qemu_getauxval(AT_HWCAP);
+#  endif
+
     if (*isize == 0) {
         *isize = qemu_getauxval(AT_ICACHEBSIZE);
     }
     if (*dsize == 0) {
         *dsize = qemu_getauxval(AT_DCACHEBSIZE);
     }
+
+#  ifdef PPC_FEATURE_ICACHE_SNOOP
+    have_coherent_icache = (hwcap & PPC_FEATURE_ICACHE_SNOOP) != 0;
+#  endif
 }
+# endif
 
 #else
 static void arch_cache_info(int *isize, int *dsize) { }
-- 
2.35.1

Re: [PATCH] util: optimise flush_idcache_range when the ppc host has coherent icache

Posted by Richard Henderson 3 years, 8 months ago

On 5/19/22 07:11, Nicholas Piggin wrote:
> dcache writeback and icache invalidate is not required when icache is
> coherent, a shorter fixed-length sequence can be used which just has to
> flush and re-fetch instructions that were in-flight.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
> 
> I haven't been able to measure a significant performance difference
> with this, qemu isn't flushing large ranges frequently so the old sequence
> is not that slow.

Yeah, we should be flushing smallish regions (< 1-4k), as we generate TranslationBlocks. 
And hopefully the translation cache is large enough that we spend more time executing 
blocks than re-compiling them.  ;-)

> +++ b/include/qemu/cacheflush.h
> @@ -28,6 +28,10 @@ static inline void flush_idcache_range(uintptr_t rx, uintptr_t rw, size_t len)
>   
>   #else
>   
> +#if defined(__powerpc__)
> +extern bool have_coherent_icache;
> +#endif

Ug.  I'm undecided where to put this.  I'm tempted to say...

> --- a/util/cacheflush.c
> +++ b/util/cacheflush.c
> @@ -108,7 +108,16 @@ void flush_idcache_range(uintptr_t rx, uintptr_t rw, size_t len)

... here in cacheflush.c, with a comment that the variable is defined and initialized in 
cacheinfo.c.

I'm even more tempted to merge the two files to put all of the machine-specific cache data 
in the same place, then this variable can be static.  There's even an existing TODO 
comment in cacheflush.c for aarch64.

>       b = rw & ~(dsize - 1);
> +
> +    if (have_coherent_icache) {
> +        asm volatile ("sync" : : : "memory");
> +        asm volatile ("icbi 0,%0" : : "r"(b) : "memory");
> +        asm volatile ("isync" : : : "memory");
> +        return;
> +    }

Where can I find definitive rules on this?

Note that rx may not equal rw, and that we've got two virtual mappings for the same 
memory, one for "data" that is read-write and one for "execute" that is read-execute. 
(This split is enabled only for --enable-debug-tcg builds on linux, to make sure we don't 
regress apple m1, which requires the split all of the time.)

In particular, you're flushing one icache line with the dcache address, and that you're 
not flushing any of the other lines.  Is the coherent icache thing really that we may 
simply skip the dcache flush step, but must still flush all of the icache lines?

Without docs, "icache snoop" to me would imply that we only need the two barriers and no 
flushes at all, just to make sure all memory writes complete before any new instructions 
are executed.  This would be like the two AArch64 bits, IDC and DIC, which indicate that 
the two caches are coherent to Point of Unification, which leaves us with just the 
Instruction Sequence Barrier at the end of the function.

> +bool have_coherent_icache = false;

scripts/checkpatch.pl should complain this is initialized to 0.

>   static void arch_cache_info(int *isize, int *dsize)
>   {
> +#  ifdef PPC_FEATURE_ICACHE_SNOOP
> +    unsigned long hwcap = qemu_getauxval(AT_HWCAP);
> +#  endif
> +
>       if (*isize == 0) {
>           *isize = qemu_getauxval(AT_ICACHEBSIZE);
>       }
>       if (*dsize == 0) {
>           *dsize = qemu_getauxval(AT_DCACHEBSIZE);
>       }
> +
> +#  ifdef PPC_FEATURE_ICACHE_SNOOP
> +    have_coherent_icache = (hwcap & PPC_FEATURE_ICACHE_SNOOP) != 0;
> +#  endif

Better with only one ifdef, moving this second hunk up.

It would be nice if there were some kernel documentation for this...

r~

Re: [PATCH] util: optimise flush_idcache_range when the ppc host has coherent icache

Posted by Nicholas Piggin 3 years, 8 months ago

Excerpts from Richard Henderson's message of May 20, 2022 4:31 am:
> On 5/19/22 07:11, Nicholas Piggin wrote:
>> dcache writeback and icache invalidate is not required when icache is
>> coherent, a shorter fixed-length sequence can be used which just has to
>> flush and re-fetch instructions that were in-flight.
>> 
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>> 
>> I haven't been able to measure a significant performance difference
>> with this, qemu isn't flushing large ranges frequently so the old sequence
>> is not that slow.
> 
> Yeah, we should be flushing smallish regions (< 1-4k), as we generate TranslationBlocks. 
> And hopefully the translation cache is large enough that we spend more time executing 
> blocks than re-compiling them.  ;-)
> 
> 
>> +++ b/include/qemu/cacheflush.h
>> @@ -28,6 +28,10 @@ static inline void flush_idcache_range(uintptr_t rx, uintptr_t rw, size_t len)
>>   
>>   #else
>>   
>> +#if defined(__powerpc__)
>> +extern bool have_coherent_icache;
>> +#endif
> 
> Ug.  I'm undecided where to put this.  I'm tempted to say...
> 
>> --- a/util/cacheflush.c
>> +++ b/util/cacheflush.c
>> @@ -108,7 +108,16 @@ void flush_idcache_range(uintptr_t rx, uintptr_t rw, size_t len)
> 
> ... here in cacheflush.c, with a comment that the variable is defined and initialized in 
> cacheinfo.c.
> 
> I'm even more tempted to merge the two files to put all of the machine-specific cache data 
> in the same place, then this variable can be static.  There's even an existing TODO 
> comment in cacheflush.c for aarch64.

That might be nice. Do you want me to look at doing that first?

>>       b = rw & ~(dsize - 1);
>> +
>> +    if (have_coherent_icache) {
>> +        asm volatile ("sync" : : : "memory");
>> +        asm volatile ("icbi 0,%0" : : "r"(b) : "memory");
>> +        asm volatile ("isync" : : : "memory");
>> +        return;
>> +    }
> 
> Where can I find definitive rules on this?

In processor manuals (I don't know if there are any notes about this in 
the ISA, I would be tempted to say there should be since many processors
implement it).

POWER9 UM, 4.6.2.2 Instruction Cache Block Invalidate (icbi) 

https://ibm.ent.box.com/s/tmklq90ze7aj8f4n32er1mu3sy9u8k3k

> Note that rx may not equal rw, and that we've got two virtual mappings for the same 
> memory, one for "data" that is read-write and one for "execute" that is read-execute. 
> (This split is enabled only for --enable-debug-tcg builds on linux, to make sure we don't 
> regress apple m1, which requires the split all of the time.)
> 
> In particular, you're flushing one icache line with the dcache address, and that you're 
> not flushing any of the other lines.  Is the coherent icache thing really that we may 
> simply skip the dcache flush step, but must still flush all of the icache lines?

Yeah it's just a funny sequence the processor implements. It treats icbi 
almost as a no-op except that it sets a flag such that the next isync 
will flush and refetch the pipeline. It doesn't do any cache flushing.

> Without docs, "icache snoop" to me would imply that we only need the two barriers and no 
> flushes at all, just to make sure all memory writes complete before any new instructions 
> are executed.  This would be like the two AArch64 bits, IDC and DIC, which indicate that 
> the two caches are coherent to Point of Unification, which leaves us with just the 
> Instruction Sequence Barrier at the end of the function.
> 
> 
>> +bool have_coherent_icache = false;
> 
> scripts/checkpatch.pl should complain this is initialized to 0.
> 
> 
>>   static void arch_cache_info(int *isize, int *dsize)
>>   {
>> +#  ifdef PPC_FEATURE_ICACHE_SNOOP
>> +    unsigned long hwcap = qemu_getauxval(AT_HWCAP);
>> +#  endif
>> +
>>       if (*isize == 0) {
>>           *isize = qemu_getauxval(AT_ICACHEBSIZE);
>>       }
>>       if (*dsize == 0) {
>>           *dsize = qemu_getauxval(AT_DCACHEBSIZE);
>>       }
>> +
>> +#  ifdef PPC_FEATURE_ICACHE_SNOOP
>> +    have_coherent_icache = (hwcap & PPC_FEATURE_ICACHE_SNOOP) != 0;
>> +#  endif
> 
> Better with only one ifdef, moving this second hunk up.

Will clean those bits up, thanks.

> It would be nice if there were some kernel documentation for this...

arm64 has kernel docs for hwcaps... powerpc probably should as well.
Good point, I might do a patch for that.

Thanks,
Nick

Re: [PATCH] util: optimise flush_idcache_range when the ppc host has coherent icache

Posted by Richard Henderson 3 years, 8 months ago

On 5/19/22 17:04, Nicholas Piggin wrote:
>> I'm even more tempted to merge the two files to put all of the machine-specific cache data
>> in the same place, then this variable can be static.  There's even an existing TODO
>> comment in cacheflush.c for aarch64.
> 
> That might be nice. Do you want me to look at doing that first?

If you wouldn't mind, please do.  I'll take care of aarch64 cleanup related to the TODO 
afterward.

>>>        b = rw & ~(dsize - 1);
>>> +
>>> +    if (have_coherent_icache) {
>>> +        asm volatile ("sync" : : : "memory");
>>> +        asm volatile ("icbi 0,%0" : : "r"(b) : "memory");
>>> +        asm volatile ("isync" : : : "memory");
>>> +        return;
>>> +    }
>>
>> Where can I find definitive rules on this?
> 
> In processor manuals (I don't know if there are any notes about this in
> the ISA, I would be tempted to say there should be since many processors
> implement it).
> 
> POWER9 UM, 4.6.2.2 Instruction Cache Block Invalidate (icbi)
> 
> https://ibm.ent.box.com/s/tmklq90ze7aj8f4n32er1mu3sy9u8k3k
...
> Yeah it's just a funny sequence the processor implements. It treats icbi
> almost as a no-op except that it sets a flag such that the next isync
> will flush and refetch the pipeline. It doesn't do any cache flushing.

Thanks.  A short comment in the code would be helpful here.

Also, since the docs say "any address", you might as well just use rx unmodified and sink 
the computation of 'b' back next to 'e'.


r~