tests: Make hash table mocking arch independent

[libvirt] [PATCH 3/3] tests: deterministichash: Make hash tables arch-independent

Posted by Peter Krempa 7 years, 11 months ago

It turns out that our implementation of the hashing function is
endian-dependent and thus if used on various architectures the testsuite
may have different results. Work this around by mocking virHashCodeGen
to something which does not use bit operations instead of just setting a
deterministic seed.
---
 .../qemumonitorjson-nodename-relative.result       | 24 +++++++++++-----------
 .../qemumonitorjson-nodename-same-backing.result   | 24 +++++++++++-----------
 tests/virdeterministichashmock.c                   | 17 +++++++++++----
 tests/virmacmaptestdata/simple2.json               | 12 +++++------
 4 files changed, 43 insertions(+), 34 deletions(-)

diff --git a/tests/qemumonitorjsondata/qemumonitorjson-nodename-relative.result b/tests/qemumonitorjsondata/qemumonitorjson-nodename-relative.result
index 6c0c77618..5288319d3 100644
--- a/tests/qemumonitorjsondata/qemumonitorjson-nodename-relative.result
+++ b/tests/qemumonitorjsondata/qemumonitorjson-nodename-relative.result
@@ -1,15 +1,3 @@
-drive-ide0-0-1
-filename    : '/var/lib/libvirt/images/relsnap.qcow2'
-format node : '#block1290'
-format drv  : 'qcow2'
-storage node: '#block1107'
-storage drv : 'file'
-  filename    : '/var/lib/libvirt/images/base.qcow2'
-  format node : '#block927'
-  format drv  : 'qcow2'
-  storage node: '#block800'
-  storage drv : 'file'
-
 drive-ide0-0-0
 filename    : '/var/lib/libvirt/images/img3'
 format node : '#block118'
@@ -31,3 +19,15 @@ storage drv : 'file'
       format drv  : 'qcow2'
       storage node: '#block614'
       storage drv : 'file'
+
+drive-ide0-0-1
+filename    : '/var/lib/libvirt/images/relsnap.qcow2'
+format node : '#block1290'
+format drv  : 'qcow2'
+storage node: '#block1107'
+storage drv : 'file'
+  filename    : '/var/lib/libvirt/images/base.qcow2'
+  format node : '#block927'
+  format drv  : 'qcow2'
+  storage node: '#block800'
+  storage drv : 'file'
diff --git a/tests/qemumonitorjsondata/qemumonitorjson-nodename-same-backing.result b/tests/qemumonitorjsondata/qemumonitorjson-nodename-same-backing.result
index 87431f7ca..7b12a1746 100644
--- a/tests/qemumonitorjsondata/qemumonitorjson-nodename-same-backing.result
+++ b/tests/qemumonitorjsondata/qemumonitorjson-nodename-same-backing.result
@@ -1,15 +1,3 @@
-drive-sata0-0-1
-filename    : '/var/lib/libvirt/images/b.qcow2'
-format node : '#block548'
-format drv  : 'qcow2'
-storage node: '#block487'
-storage drv : 'file'
-  filename    : '/var/lib/libvirt/images/base.qcow2'
-  format node : '#block771'
-  format drv  : 'qcow2'
-  storage node: '#block692'
-  storage drv : 'file'
-
 drive-sata0-0-0
 filename    : '/var/lib/libvirt/images/a.qcow2'
 format node : '#block132'
@@ -21,3 +9,15 @@ storage drv : 'file'
   format drv  : 'qcow2'
   storage node: '#block224'
   storage drv : 'file'
+
+drive-sata0-0-1
+filename    : '/var/lib/libvirt/images/b.qcow2'
+format node : '#block548'
+format drv  : 'qcow2'
+storage node: '#block487'
+storage drv : 'file'
+  filename    : '/var/lib/libvirt/images/base.qcow2'
+  format node : '#block771'
+  format drv  : 'qcow2'
+  storage node: '#block692'
+  storage drv : 'file'
diff --git a/tests/virdeterministichashmock.c b/tests/virdeterministichashmock.c
index d01f1c9e4..cd80cfcb5 100644
--- a/tests/virdeterministichashmock.c
+++ b/tests/virdeterministichashmock.c
@@ -20,10 +20,19 @@

 #include <config.h>

-#include "virrandom.h"
+#include "util/virhashcode.h"

-uint64_t virRandomBits(int nbits ATTRIBUTE_UNUSED)
+uint32_t
+virHashCodeGen(const void *key,
+               size_t len,
+               uint32_t seed ATTRIBUTE_UNUSED)
 {
-    return 4; /* chosen by fair dice roll.
-                 guaranteed to be random. */
+    const uint8_t *k = key;
+    uint32_t h = 0;
+    size_t i;
+
+    for (i = 0; i < len; i++)
+        h += k[i];
+
+    return h;
 }
diff --git a/tests/virmacmaptestdata/simple2.json b/tests/virmacmaptestdata/simple2.json
index 91b2cde0c..52132c241 100644
--- a/tests/virmacmaptestdata/simple2.json
+++ b/tests/virmacmaptestdata/simple2.json
@@ -1,16 +1,16 @@
 [
   {
-    "domain": "f25",
+    "domain": "f24",
     "macs": [
-      "00:11:22:33:44:55",
-      "aa:bb:cc:00:11:22"
+      "aa:bb:cc:dd:ee:ff",
+      "a1:b2:c3:d4:e5:f6"
     ]
   },
   {
-    "domain": "f24",
+    "domain": "f25",
     "macs": [
-      "aa:bb:cc:dd:ee:ff",
-      "a1:b2:c3:d4:e5:f6"
+      "00:11:22:33:44:55",
+      "aa:bb:cc:00:11:22"
     ]
   }
 ]
-- 
2.13.2

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH 3/3] tests: deterministichash: Make hash tables arch-independent

Posted by Bjoern Walk 7 years, 11 months ago

Peter Krempa <pkrempa@redhat.com> [2017-08-02, 05:39PM +0200]:
> It turns out that our implementation of the hashing function is
> endian-dependent and thus if used on various architectures the testsuite
> may have different results. Work this around by mocking virHashCodeGen
> to something which does not use bit operations instead of just setting a
> deterministic seed.

This does fix the issue on S390. Anyways, I'd also like to see the tests
fixed that rely on the ordering of the hash table. And code that uses
the hash table should be tested that it does NOT rely on the ordering.

Tested-by: Bjoern Walk <bwalk@linux.vnet.ibm.com>
--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH 3/3] tests: deterministichash: Make hash tables arch-independent

Posted by Peter Krempa 7 years, 11 months ago

On Thu, Aug 03, 2017 at 07:24:35 +0200, Bjoern Walk wrote:
> Peter Krempa <pkrempa@redhat.com> [2017-08-02, 05:39PM +0200]:
> > It turns out that our implementation of the hashing function is
> > endian-dependent and thus if used on various architectures the testsuite
> > may have different results. Work this around by mocking virHashCodeGen
> > to something which does not use bit operations instead of just setting a
> > deterministic seed.
> 
> This does fix the issue on S390. Anyways, I'd also like to see the tests
> fixed that rely on the ordering of the hash table. And code that uses

The tests are fixed. They are ordered correctly to the newly mocked
function. I don't quite get what more you'd like to see fixed.

> the hash table should be tested that it does NOT rely on the ordering.

Well, that's a property of the hash table that the code should not
depend on ordering. In fact the only part which is slightly dependant on
ordering is the test suite. The fix to mock the ordering function
properly is tested.

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH 3/3] tests: deterministichash: Make hash tables arch-independent

Posted by Bjoern Walk 7 years, 11 months ago

Peter Krempa <pkrempa@redhat.com> [2017-08-03, 09:47AM +0200]:
> On Thu, Aug 03, 2017 at 07:24:35 +0200, Bjoern Walk wrote:
> > Peter Krempa <pkrempa@redhat.com> [2017-08-02, 05:39PM +0200]:
> > > It turns out that our implementation of the hashing function is
> > > endian-dependent and thus if used on various architectures the testsuite
> > > may have different results. Work this around by mocking virHashCodeGen
> > > to something which does not use bit operations instead of just setting a
> > > deterministic seed.
> > 
> > This does fix the issue on S390. Anyways, I'd also like to see the tests
> > fixed that rely on the ordering of the hash table. And code that uses
> 
> The tests are fixed. They are ordered correctly to the newly mocked
> function. I don't quite get what more you'd like to see fixed.
> 

No, the test is still dependent on the ordering. Just now the mocked
hash table has deterministic ordering. This is not the same. Although it
is improbable that this will bite us somewhere, I'd still prefer a clean
solution.

> > the hash table should be tested that it does NOT rely on the ordering.
> 
> Well, that's a property of the hash table that the code should not
> depend on ordering.

Yes, and I think it would be a good idea to test for this (no idea how
to do this though).

> In fact the only part which is slightly dependant on ordering is the
> test suite. The fix to mock the ordering function properly is tested.
> 

Like with all mocks, we now test a different version of code in the test
suite. We test the code under the assumption that the hash table is
ordered deterministically. However, the real code doesn't fulfill this
assumption (rightly so, because ordering is not a property of the hash
table). There's a discrepancy.

For example, let's assume we have code that writes out an XML file from
a hash table. We do this similar to the test code in
testBlockNodeNameDetect. Writing a test against this will never fail,
because we test against the mocked hash table. However, the actual code
produces results that is dependent on the ordering.

I know this is nitpicky, but if someone down the road runs into any
problems with this it is nasty do understand.

> --
> libvir-list mailing list
> libvir-list@redhat.com
> https://www.redhat.com/mailman/listinfo/libvir-list

-- 
IBM Systems
Linux on z Systems & Virtualization Development
------------------------------------------------------------------------
IBM Deutschland
Schönaicher Str. 220
71032 Böblingen
Phone: +49 7031 16 1819
E-Mail: bwalk@de.ibm.com
------------------------------------------------------------------------
IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294 
--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH 3/3] tests: deterministichash: Make hash tables arch-independent

Posted by Peter Krempa 7 years, 11 months ago

On Thu, Aug 03, 2017 at 10:28:59 +0200, Bjoern Walk wrote:
> Peter Krempa <pkrempa@redhat.com> [2017-08-03, 09:47AM +0200]:
> > On Thu, Aug 03, 2017 at 07:24:35 +0200, Bjoern Walk wrote:
> > > Peter Krempa <pkrempa@redhat.com> [2017-08-02, 05:39PM +0200]:
> > > > It turns out that our implementation of the hashing function is
> > > > endian-dependent and thus if used on various architectures the testsuite
> > > > may have different results. Work this around by mocking virHashCodeGen
> > > > to something which does not use bit operations instead of just setting a
> > > > deterministic seed.
> > > 
> > > This does fix the issue on S390. Anyways, I'd also like to see the tests
> > > fixed that rely on the ordering of the hash table. And code that uses
> > 
> > The tests are fixed. They are ordered correctly to the newly mocked
> > function. I don't quite get what more you'd like to see fixed.
> > 
> 
> No, the test is still dependent on the ordering. Just now the mocked
> hash table has deterministic ordering. This is not the same. Although it
> is improbable that this will bite us somewhere, I'd still prefer a clean
> solution.

I don't think it's worth doing this in the tests. The hash table at
least in case of the node name detection code is just an intermediate
step to fill the data into the domain definition and it's a conveniet
place to test the detection code.

Testing it in other places would remove the dependency on deterministic
ordering of the hash table but would not really add much value. The cost
would be much more complexity in the test case which I don't think it's
worth.

If you feel bothered by it, please post patches. I think currently the
testsuite tests the complex portion of the code without the bloating
necessary for the so-called "clean" solution.
--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH 3/3] tests: deterministichash: Make hash tables arch-independent

Posted by John Ferlan 7 years, 10 months ago

On 08/03/2017 04:50 AM, Peter Krempa wrote:
> On Thu, Aug 03, 2017 at 10:28:59 +0200, Bjoern Walk wrote:
>> Peter Krempa <pkrempa@redhat.com> [2017-08-03, 09:47AM +0200]:
>>> On Thu, Aug 03, 2017 at 07:24:35 +0200, Bjoern Walk wrote:
>>>> Peter Krempa <pkrempa@redhat.com> [2017-08-02, 05:39PM +0200]:
>>>>> It turns out that our implementation of the hashing function is
>>>>> endian-dependent and thus if used on various architectures the testsuite
>>>>> may have different results. Work this around by mocking virHashCodeGen
>>>>> to something which does not use bit operations instead of just setting a
>>>>> deterministic seed.
>>>>
>>>> This does fix the issue on S390. Anyways, I'd also like to see the tests
>>>> fixed that rely on the ordering of the hash table. And code that uses
>>>
>>> The tests are fixed. They are ordered correctly to the newly mocked
>>> function. I don't quite get what more you'd like to see fixed.
>>>
>>
>> No, the test is still dependent on the ordering. Just now the mocked
>> hash table has deterministic ordering. This is not the same. Although it
>> is improbable that this will bite us somewhere, I'd still prefer a clean
>> solution.
> 
> I don't think it's worth doing this in the tests. The hash table at
> least in case of the node name detection code is just an intermediate
> step to fill the data into the domain definition and it's a conveniet
> place to test the detection code.
> 
> Testing it in other places would remove the dependency on deterministic
> ordering of the hash table but would not really add much value. The cost
> would be much more complexity in the test case which I don't think it's
> worth.
> 
> If you feel bothered by it, please post patches. I think currently the
> testsuite tests the complex portion of the code without the bloating
> necessary for the so-called "clean" solution.
> 
> 

NB: I didn't dig into the qemumonitorjsontest algorithm, but...

While going through the common object changes, I ended up looking at and
thinking about the hash table algorithms, initially it was just the
"grow" algorithm as I think it is a bit "aggressive" (perhaps wasteful)
since as soon as 1 bucket chain exceeds 8 elements, the table is resized
by 8 unless the new size is 8 * 2048 - at which time no matter how full
a bucket is we don't resize the table.

The next thing I considered was using a prime number as the table bucket
size and it seems by just doing that will cause qemumonitorjsontest to
fail. Easy enough to reproduce by changing the virHashCreate call in
qemuBlockNodeNameGetBackingChain to use 53 instead of 50 (even 49 causes
failure). It doesn't matter if the test source also adjusts the initial
hash size.

Of course this makes sense given that virHashCodeGen is the backend of
the virHashComputeKey algorithm that uses the return value %
table->size. So if "size" changes, then obviously order changes.

While I get the code bloat conundrum, but if the order doesn't matter
then I wonder what other variables could change that would affect this
test and whether we really want to be constantly altering the mocks
"just" to get the right answer for this test.

John

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH 3/3] tests: deterministichash: Make hash tables arch-independent

Posted by Peter Krempa 7 years, 10 months ago

On Mon, Aug 14, 2017 at 20:46:10 -0400, John Ferlan wrote:
> On 08/03/2017 04:50 AM, Peter Krempa wrote:

[ trimmed off-topic part ]

> NB: I didn't dig into the qemumonitorjsontest algorithm, but...
> 
> While going through the common object changes, I ended up looking at and
> thinking about the hash table algorithms, initially it was just the
> "grow" algorithm as I think it is a bit "aggressive" (perhaps wasteful)
> since as soon as 1 bucket chain exceeds 8 elements, the table is resized
> by 8 unless the new size is 8 * 2048 - at which time no matter how full
> a bucket is we don't resize the table

Resizing the table is very expensive, so it makes sense to scale it
aggresively. At some point though it would take too much memory.

If some place in the code expects massive hash table, it can set the
hash table to > 2048 manually.

> The next thing I considered was using a prime number as the table bucket
> size and it seems by just doing that will cause qemumonitorjsontest to

What would be the benefit of such change? I don't really see how prime
numbers would improve anything performance-wise.

> fail. Easy enough to reproduce by changing the virHashCreate call in
> qemuBlockNodeNameGetBackingChain to use 53 instead of 50 (even 49 causes
> failure). It doesn't matter if the test source also adjusts the initial
> hash size.
> 
> Of course this makes sense given that virHashCodeGen is the backend of
> the virHashComputeKey algorithm that uses the return value %
> table->size. So if "size" changes, then obviously order changes.

As you've said that is expected. I don't quite follow what is your
point here.

> While I get the code bloat conundrum, but if the order doesn't matter
> then I wonder what other variables could change that would affect this
> test and whether we really want to be constantly altering the mocks
> "just" to get the right answer for this test.

The order depends on the result of the hasing function, the number of
buckets and the order you've added the entries. All of those is
expected when dealing with a hash table.

After this last patch, you are guaranteed to have the hash function
return deterministic results (even on big endian-boxes).

The hash size is not really altered very often since you can't really
see a measurable performance benefit in most cases. This is also
strenghtened by the fact, that the hash function is randomized by a
random number, so in real usage you won't be able to deterministically
cause hash collisions.

In the test code, the cases are added in a deterministic order. Since
the hashing function in the tests is deterministic as well, the only
thing that would influence the ordering is the bucket count. We don't
modify them too often.

I don't see what would make us alter mocks "just" to get the right
answer here. The reason for this patch was, that the hash function is
using bit operations and then treating the result as integers, which
obviously does not work deterministically between big and little endian
arches. Other than that, the test is now fully deterministic.

Peter
--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH 3/3] tests: deterministichash: Make hash tables arch-independent

Posted by Michal Privoznik 7 years, 10 months ago

On 08/15/2017 02:01 PM, Peter Krempa wrote:
> On Mon, Aug 14, 2017 at 20:46:10 -0400, John Ferlan wrote:
>> On 08/03/2017 04:50 AM, Peter Krempa wrote:
> 
> [ trimmed off-topic part ]
> 
>> NB: I didn't dig into the qemumonitorjsontest algorithm, but...
>>
>> While going through the common object changes, I ended up looking at and
>> thinking about the hash table algorithms, initially it was just the
>> "grow" algorithm as I think it is a bit "aggressive" (perhaps wasteful)
>> since as soon as 1 bucket chain exceeds 8 elements, the table is resized
>> by 8 unless the new size is 8 * 2048 - at which time no matter how full
>> a bucket is we don't resize the table
> 
> Resizing the table is very expensive, so it makes sense to scale it
> aggresively. At some point though it would take too much memory.
> 
> If some place in the code expects massive hash table, it can set the
> hash table to > 2048 manually.
> 
>> The next thing I considered was using a prime number as the table bucket
>> size and it seems by just doing that will cause qemumonitorjsontest to
> 
> What would be the benefit of such change? I don't really see how prime
> numbers would improve anything performance-wise.

Because if your hash function is stupid it can cluster all the keys
together. Primes, however, have way fewer divisors, therefore clustering
is harder to achieve. Of course, you can construct a hash function so
that it's utterly stupid (e.g. {return 4;}) and then all of our
optimizations are thrown out of the window. But I believe that using
prime numbers for table size is advised in the literature too.
Apparently, wiki mentions this fact too:

https://en.wikipedia.org/wiki/Hash_table#Choosing_a_hash_function

Michal

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH 3/3] tests: deterministichash: Make hash tables arch-independent

Posted by Daniel P. Berrange 7 years, 10 months ago

On Tue, Aug 15, 2017 at 03:03:07PM +0200, Michal Privoznik wrote:
> On 08/15/2017 02:01 PM, Peter Krempa wrote:
> > On Mon, Aug 14, 2017 at 20:46:10 -0400, John Ferlan wrote:
> >> On 08/03/2017 04:50 AM, Peter Krempa wrote:
> > 
> > [ trimmed off-topic part ]
> > 
> >> NB: I didn't dig into the qemumonitorjsontest algorithm, but...
> >>
> >> While going through the common object changes, I ended up looking at and
> >> thinking about the hash table algorithms, initially it was just the
> >> "grow" algorithm as I think it is a bit "aggressive" (perhaps wasteful)
> >> since as soon as 1 bucket chain exceeds 8 elements, the table is resized
> >> by 8 unless the new size is 8 * 2048 - at which time no matter how full
> >> a bucket is we don't resize the table
> > 
> > Resizing the table is very expensive, so it makes sense to scale it
> > aggresively. At some point though it would take too much memory.
> > 
> > If some place in the code expects massive hash table, it can set the
> > hash table to > 2048 manually.
> > 
> >> The next thing I considered was using a prime number as the table bucket
> >> size and it seems by just doing that will cause qemumonitorjsontest to
> > 
> > What would be the benefit of such change? I don't really see how prime
> > numbers would improve anything performance-wise.
> 
> Because if your hash function is stupid it can cluster all the keys
> together. Primes, however, have way fewer divisors, therefore clustering
> is harder to achieve. Of course, you can construct a hash function so
> that it's utterly stupid (e.g. {return 4;}) and then all of our
> optimizations are thrown out of the window. But I believe that using
> prime numbers for table size is advised in the literature too.
> Apparently, wiki mentions this fact too:
> 
> https://en.wikipedia.org/wiki/Hash_table#Choosing_a_hash_function

If your hash function has a problem with clustering, then you really
should replace the hash function with a better one - which we already
did a few years back now :-)

Incidentally, we should replace our murmurhash funtion with siphash
at some point, because people have found attacks against murmurhash,
and so most devs have migrated over to siphash.

Regards,
Daniel

[1] https://emboss.github.io/blog/2012/12/14/breaking-murmur-hash-flooding-dos-reloaded/
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH 3/3] tests: deterministichash: Make hash tables arch-independent

Posted by Peter Krempa 7 years, 10 months ago

On Tue, Aug 15, 2017 at 15:03:07 +0200, Michal Privoznik wrote:
> On 08/15/2017 02:01 PM, Peter Krempa wrote:
> > On Mon, Aug 14, 2017 at 20:46:10 -0400, John Ferlan wrote:
> >> On 08/03/2017 04:50 AM, Peter Krempa wrote:

[...]

> >> The next thing I considered was using a prime number as the table bucket
> >> size and it seems by just doing that will cause qemumonitorjsontest to
> > 
> > What would be the benefit of such change? I don't really see how prime
> > numbers would improve anything performance-wise.
> 
> Because if your hash function is stupid it can cluster all the keys
> together. Primes, however, have way fewer divisors, therefore clustering
> is harder to achieve. Of course, you can construct a hash function so
> that it's utterly stupid (e.g. {return 4;}) and then all of our
> optimizations are thrown out of the window. But I believe that using
> prime numbers for table size is advised in the literature too.
> Apparently, wiki mentions this fact too:
> 
> https://en.wikipedia.org/wiki/Hash_table#Choosing_a_hash_function

Our hash function is quite complex. I doubt that you will be able to
achieve any measurable performance benefit by this change. Especially
due to the random factor we are using.

The hash function used in tests is dumb, but it's dumb on purpose. I
really doubt that messing with this is worth.
--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH 3/3] tests: deterministichash: Make hash tables arch-independent

Posted by John Ferlan 7 years, 10 months ago

On 08/15/2017 08:01 AM, Peter Krempa wrote:
> On Mon, Aug 14, 2017 at 20:46:10 -0400, John Ferlan wrote:
>> On 08/03/2017 04:50 AM, Peter Krempa wrote:
> 
> [ trimmed off-topic part ]
> 
>> NB: I didn't dig into the qemumonitorjsontest algorithm, but...
>>
>> While going through the common object changes, I ended up looking at and
>> thinking about the hash table algorithms, initially it was just the
>> "grow" algorithm as I think it is a bit "aggressive" (perhaps wasteful)
>> since as soon as 1 bucket chain exceeds 8 elements, the table is resized
>> by 8 unless the new size is 8 * 2048 - at which time no matter how full
>> a bucket is we don't resize the table
> 
> Resizing the table is very expensive, so it makes sense to scale it
> aggresively. At some point though it would take too much memory.
> 
> If some place in the code expects massive hash table, it can set the
> hash table to > 2048 manually.
> 

Right - avoiding excessive or unnecessary resizing is something I was
considering. One thing that was at the back of my mind is "somewhat
recent" discussions about 10s of thousands of disk/volumes. IIRC it's
been discussed on qemu list as it relates to libguestfs.

Because volume names tend to be less random than a UUID, I was looking
at how the existing algorithms operate when you have say "disk00001"
through "disk99999" w/r/t bucket layout and length of chains when we
reach the maximum bucket count.

Still even with that, resizing a table just because one hash bucket
chain goes above 8 just seemed unnecessary. Still a work in progress
though - I may come up with nothing. It's the classic power struggle of
time vs. space.

>> The next thing I considered was using a prime number as the table bucket
>> size and it seems by just doing that will cause qemumonitorjsontest to
> 
> What would be the benefit of such change? I don't really see how prime
> numbers would improve anything performance-wise.
> 

Nothing performance wise directly that comes to mind. I see that while
composing Michal and Daniel have responded.

FWIW: My hacking has altered virhashtest.c to generate 1000's of UUIDs
and "elemN" names in order to check if/when one or the other has a
bucket full problem with the existing algorithms.  It's been a couple
weeks since I was looking at the data - so results are not fresh in mind
especially since I was on PTO last week.

>> fail. Easy enough to reproduce by changing the virHashCreate call in
>> qemuBlockNodeNameGetBackingChain to use 53 instead of 50 (even 49 causes
>> failure). It doesn't matter if the test source also adjusts the initial
>> hash size.
>>
>> Of course this makes sense given that virHashCodeGen is the backend of
>> the virHashComputeKey algorithm that uses the return value %
>> table->size. So if "size" changes, then obviously order changes.
> 
> As you've said that is expected. I don't quite follow what is your
> point here.
> 

I was pointing out that by changing the size you get different ordered
results, nothing more, nothing less. It just so happens that 50 returns
this set of results, but 53 is different.

>> While I get the code bloat conundrum, but if the order doesn't matter
>> then I wonder what other variables could change that would affect this
>> test and whether we really want to be constantly altering the mocks
>> "just" to get the right answer for this test.
> 
> The order depends on the result of the hasing function, the number of
> buckets and the order you've added the entries. All of those is
> expected when dealing with a hash table.
> 
> After this last patch, you are guaranteed to have the hash function
> return deterministic results (even on big endian-boxes).
> 

Change the size to 53, rebuild, and run the test. It'll take a few
minutes, but I think you'd see a failure.

> The hash size is not really altered very often since you can't really
> see a measurable performance benefit in most cases. This is also
> strenghtened by the fact, that the hash function is randomized by a
> random number, so in real usage you won't be able to deterministically
> cause hash collisions.

True... at 50 with 4 elements you won't see a resize. You could possibly
have a table size of 1 in this instance and be fine, but that only works
well for the test! It's more problematic for real world because a table
size of 1 grows to 8, 64, etc.  Some would say the minimum table size
should be at least 8 if not the next prime, e.g. 11.

In the long run, UUID's are random, but "#blockN" values are far less
random - it's the human randomness factor. No one is going to create 100
randomly named domain names or volume names, so factoring in the less
than randomly named elements is what I've been pondering.

> 
> In the test code, the cases are added in a deterministic order. Since
> the hashing function in the tests is deterministic as well, the only
> thing that would influence the ordering is the bucket count. We don't
> modify them too often.
> 
> I don't see what would make us alter mocks "just" to get the right
> answer here. The reason for this patch was, that the hash function is
> using bit operations and then treating the result as integers, which
> obviously does not work deterministically between big and little endian
> arches. Other than that, the test is now fully deterministic.
> 
> Peter
> 

Understood - until someone changes the size of the table to be random or
less deterministic, then there's no problem.  But if that happens,
there's a problem.

John

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH 3/3] tests: deterministichash: Make hash tables arch-independent

Posted by Daniel P. Berrange 7 years, 10 months ago

On Tue, Aug 15, 2017 at 10:02:38AM -0400, John Ferlan wrote:
> 
> 
> On 08/15/2017 08:01 AM, Peter Krempa wrote:
> > On Mon, Aug 14, 2017 at 20:46:10 -0400, John Ferlan wrote:
> >> On 08/03/2017 04:50 AM, Peter Krempa wrote:
> > 
> > [ trimmed off-topic part ]
> > 
> >> NB: I didn't dig into the qemumonitorjsontest algorithm, but...
> >>
> >> While going through the common object changes, I ended up looking at and
> >> thinking about the hash table algorithms, initially it was just the
> >> "grow" algorithm as I think it is a bit "aggressive" (perhaps wasteful)
> >> since as soon as 1 bucket chain exceeds 8 elements, the table is resized
> >> by 8 unless the new size is 8 * 2048 - at which time no matter how full
> >> a bucket is we don't resize the table
> > 
> > Resizing the table is very expensive, so it makes sense to scale it
> > aggresively. At some point though it would take too much memory.
> > 
> > If some place in the code expects massive hash table, it can set the
> > hash table to > 2048 manually.
> > 
> 
> Right - avoiding excessive or unnecessary resizing is something I was
> considering. One thing that was at the back of my mind is "somewhat
> recent" discussions about 10s of thousands of disk/volumes. IIRC it's
> been discussed on qemu list as it relates to libguestfs.
> 
> Because volume names tend to be less random than a UUID, I was looking
> at how the existing algorithms operate when you have say "disk00001"
> through "disk99999" w/r/t bucket layout and length of chains when we
> reach the maximum bucket count.

For any sensible hash algorithm, such a "predictable" naming convention
is not a realk problem - indeed if it were a problem it would be considered
a security vulnerability these days.

While murmurhash is not a cryptographic hash, a single bit difference
in names should still produce a very different hashcode value. If we
switch to siphash though, which is a cryptographic hash, we will have
a strong guarantee of a completely different hash code for such names.

So again I don't think bucket size is vs primes is important here - the
choice of hash function more directly determines the collision resistance
we have.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH 3/3] tests: deterministichash: Make hash tables arch-independent

Posted by Peter Krempa 7 years, 10 months ago

On Tue, Aug 15, 2017 at 10:02:38 -0400, John Ferlan wrote:
> On 08/15/2017 08:01 AM, Peter Krempa wrote:
> > On Mon, Aug 14, 2017 at 20:46:10 -0400, John Ferlan wrote:
> >> On 08/03/2017 04:50 AM, Peter Krempa wrote:
> > 
> > [ trimmed off-topic part ]
> > 
> >> NB: I didn't dig into the qemumonitorjsontest algorithm, but...
> >>
> >> While going through the common object changes, I ended up looking at and
> >> thinking about the hash table algorithms, initially it was just the
> >> "grow" algorithm as I think it is a bit "aggressive" (perhaps wasteful)
> >> since as soon as 1 bucket chain exceeds 8 elements, the table is resized
> >> by 8 unless the new size is 8 * 2048 - at which time no matter how full
> >> a bucket is we don't resize the table
> > 
> > Resizing the table is very expensive, so it makes sense to scale it
> > aggresively. At some point though it would take too much memory.
> > 
> > If some place in the code expects massive hash table, it can set the
> > hash table to > 2048 manually.
> > 
> 
> Right - avoiding excessive or unnecessary resizing is something I was
> considering. One thing that was at the back of my mind is "somewhat
> recent" discussions about 10s of thousands of disk/volumes. IIRC it's
> been discussed on qemu list as it relates to libguestfs.

If we assume that our hash function has an even distribution of hashes
(which was not refuted yet) then for 10k entries you should have 5ish
entries per bucket. Even with 100k you are at 50 per bucket. That is
totally managable.

> Because volume names tend to be less random than a UUID, I was looking
> at how the existing algorithms operate when you have say "disk00001"
> through "disk99999" w/r/t bucket layout and length of chains when we
> reach the maximum bucket count.
> 
> Still even with that, resizing a table just because one hash bucket
> chain goes above 8 just seemed unnecessary. Still a work in progress
> though - I may come up with nothing. It's the classic power struggle of
> time vs. space.

(This was covered by dan's reply)

> >> The next thing I considered was using a prime number as the table bucket
> >> size and it seems by just doing that will cause qemumonitorjsontest to
> > 
> > What would be the benefit of such change? I don't really see how prime
> > numbers would improve anything performance-wise.
> > 
> 
> Nothing performance wise directly that comes to mind. I see that while
> composing Michal and Daniel have responded.
> 
> FWIW: My hacking has altered virhashtest.c to generate 1000's of UUIDs
> and "elemN" names in order to check if/when one or the other has a
> bucket full problem with the existing algorithms.  It's been a couple
> weeks since I was looking at the data - so results are not fresh in mind
> especially since I was on PTO last week.
> 
> >> fail. Easy enough to reproduce by changing the virHashCreate call in
> >> qemuBlockNodeNameGetBackingChain to use 53 instead of 50 (even 49 causes
> >> failure). It doesn't matter if the test source also adjusts the initial
> >> hash size.
> >>
> >> Of course this makes sense given that virHashCodeGen is the backend of
> >> the virHashComputeKey algorithm that uses the return value %
> >> table->size. So if "size" changes, then obviously order changes.
> > 
> > As you've said that is expected. I don't quite follow what is your
> > point here.
> > 
> 
> I was pointing out that by changing the size you get different ordered
> results, nothing more, nothing less. It just so happens that 50 returns
> this set of results, but 53 is different.
> 
> >> While I get the code bloat conundrum, but if the order doesn't matter
> >> then I wonder what other variables could change that would affect this
> >> test and whether we really want to be constantly altering the mocks
> >> "just" to get the right answer for this test.
> > 
> > The order depends on the result of the hasing function, the number of
> > buckets and the order you've added the entries. All of those is
> > expected when dealing with a hash table.
> > 
> > After this last patch, you are guaranteed to have the hash function
> > return deterministic results (even on big endian-boxes).
> > 
> 
> Change the size to 53, rebuild, and run the test. It'll take a few
> minutes, but I think you'd see a failure.

I fail to see the point of such excercise. It is expected that it will
happen. If we wanted to make the testsuite slightly more robust in this
regard, we could also mandate a fixed hash size for the mocked hash, but
I don't quite feel the need to do so.

> > The hash size is not really altered very often since you can't really
> > see a measurable performance benefit in most cases. This is also
> > strenghtened by the fact, that the hash function is randomized by a
> > random number, so in real usage you won't be able to deterministically
> > cause hash collisions.
> 
> True... at 50 with 4 elements you won't see a resize. You could possibly
> have a table size of 1 in this instance and be fine, but that only works
> well for the test! It's more problematic for real world because a table
> size of 1 grows to 8, 64, etc.  Some would say the minimum table size
> should be at least 8 if not the next prime, e.g. 11.

This is not a problem for real use cases. Real use cases don't care
about the ordering and thus it does not matter whether it's resized or
not. Hash table is a wrong data structure for lists.

In the tests it's a bit different. The goal of the testsuite should be
to allow to write simple tests. Otherwise it will not be easy to
persuade people to add tests at all. This means that sometimes it's
necessary to cut corners.

In case of the testsuite that was failing, the result was correct in any
ordering. The checker I wrote was dumb though ... really dumb. Just a
strcmp. I don't want to waste time to write tests to do a full N to N
checker. That does not make any sense.

If the code uses a hash table for convenience/performance reasons and
it's convenient to test the contents of the table in the test suite,
it's just best to format it to a string and compare it against a file.
Especially since you get "VIR_TEST_REGEGERATE_OUTPUT" if you happen to
modify the test. That would be rather compicated with a specific hash
table checker and make test-writing more painful.

> In the long run, UUID's are random, but "#blockN" values are far less
> random - it's the human randomness factor. No one is going to create 100
> randomly named domain names or volume names, so factoring in the less
> than randomly named elements is what I've been pondering.

This also was covered by dan's statement about the hashing function.

> > In the test code, the cases are added in a deterministic order. Since
> > the hashing function in the tests is deterministic as well, the only
> > thing that would influence the ordering is the bucket count. We don't
> > modify them too often.
> > 
> > I don't see what would make us alter mocks "just" to get the right
> > answer here. The reason for this patch was, that the hash function is
> > using bit operations and then treating the result as integers, which
> > obviously does not work deterministically between big and little endian
> > arches. Other than that, the test is now fully deterministic.
> > 
> > Peter
> > 
> 
> Understood - until someone changes the size of the table to be random or
> less deterministic, then there's no problem.  But if that happens,
> there's a problem.

Umm, why would anybody do that? It would not make any sense to allocate
it at random. That's plain crazy. For anything other than that, it would
be deterministic, and thus while the test output might change at the
point when it would be changed, it would not lose determinism in the
suite. The only randomization which makes sense is in the hashing
function itself and we do have that.

Also as it was pointed out. The bucket count choice does not really add
much in terms of distribution of elements to the buckets. Thanks to the
auto-resize algorithm it's even somewhat resistant to undersized hashes.

Adding any kind of magic on top of this just does not seem to be worth
at all.
--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH 3/3] tests: deterministichash: Make hash tables arch-independent

Posted by John Ferlan 7 years, 10 months ago


On 08/15/2017 11:47 AM, Peter Krempa wrote:
> On Tue, Aug 15, 2017 at 10:02:38 -0400, John Ferlan wrote:
>> On 08/15/2017 08:01 AM, Peter Krempa wrote:
>>> On Mon, Aug 14, 2017 at 20:46:10 -0400, John Ferlan wrote:
>>>> On 08/03/2017 04:50 AM, Peter Krempa wrote:
>>>
>>> [ trimmed off-topic part ]
>>>
>>>> NB: I didn't dig into the qemumonitorjsontest algorithm, but...
>>>>
>>>> While going through the common object changes, I ended up looking at and
>>>> thinking about the hash table algorithms, initially it was just the
>>>> "grow" algorithm as I think it is a bit "aggressive" (perhaps wasteful)
>>>> since as soon as 1 bucket chain exceeds 8 elements, the table is resized
>>>> by 8 unless the new size is 8 * 2048 - at which time no matter how full
>>>> a bucket is we don't resize the table
>>>
>>> Resizing the table is very expensive, so it makes sense to scale it
>>> aggresively. At some point though it would take too much memory.
>>>
>>> If some place in the code expects massive hash table, it can set the
>>> hash table to > 2048 manually.
>>>
>>
>> Right - avoiding excessive or unnecessary resizing is something I was
>> considering. One thing that was at the back of my mind is "somewhat
>> recent" discussions about 10s of thousands of disk/volumes. IIRC it's
>> been discussed on qemu list as it relates to libguestfs.
> 
> If we assume that our hash function has an even distribution of hashes
> (which was not refuted yet) then for 10k entries you should have 5ish
> entries per bucket. Even with 100k you are at 50 per bucket. That is
> totally managable.
> 

I really hadn't done any analysis, I was in the wondering stage and
setting up the experiment. It's a part of the code I didn't know much
about, so it was also a learning exercise. I never knew anything about
murmurhash nor siphash before today!

>> Because volume names tend to be less random than a UUID, I was looking
>> at how the existing algorithms operate when you have say "disk00001"
>> through "disk99999" w/r/t bucket layout and length of chains when we
>> reach the maximum bucket count.
>>
>> Still even with that, resizing a table just because one hash bucket
>> chain goes above 8 just seemed unnecessary. Still a work in progress
>> though - I may come up with nothing. It's the classic power struggle of
>> time vs. space.
> 
> (This was covered by dan's reply)
> 
>>>> The next thing I considered was using a prime number as the table bucket
>>>> size and it seems by just doing that will cause qemumonitorjsontest to
>>>
>>> What would be the benefit of such change? I don't really see how prime
>>> numbers would improve anything performance-wise.
>>>
>>
>> Nothing performance wise directly that comes to mind. I see that while
>> composing Michal and Daniel have responded.
>>
>> FWIW: My hacking has altered virhashtest.c to generate 1000's of UUIDs
>> and "elemN" names in order to check if/when one or the other has a
>> bucket full problem with the existing algorithms.  It's been a couple
>> weeks since I was looking at the data - so results are not fresh in mind
>> especially since I was on PTO last week.
>>
>>>> fail. Easy enough to reproduce by changing the virHashCreate call in
>>>> qemuBlockNodeNameGetBackingChain to use 53 instead of 50 (even 49 causes
>>>> failure). It doesn't matter if the test source also adjusts the initial
>>>> hash size.
>>>>
>>>> Of course this makes sense given that virHashCodeGen is the backend of
>>>> the virHashComputeKey algorithm that uses the return value %
>>>> table->size. So if "size" changes, then obviously order changes.
>>>
>>> As you've said that is expected. I don't quite follow what is your
>>> point here.
>>>
>>
>> I was pointing out that by changing the size you get different ordered
>> results, nothing more, nothing less. It just so happens that 50 returns
>> this set of results, but 53 is different.
>>
>>>> While I get the code bloat conundrum, but if the order doesn't matter
>>>> then I wonder what other variables could change that would affect this
>>>> test and whether we really want to be constantly altering the mocks
>>>> "just" to get the right answer for this test.
>>>
>>> The order depends on the result of the hasing function, the number of
>>> buckets and the order you've added the entries. All of those is
>>> expected when dealing with a hash table.
>>>
>>> After this last patch, you are guaranteed to have the hash function
>>> return deterministic results (even on big endian-boxes).
>>>
>>
>> Change the size to 53, rebuild, and run the test. It'll take a few
>> minutes, but I think you'd see a failure.
> 
> I fail to see the point of such excercise. It is expected that it will
> happen. If we wanted to make the testsuite slightly more robust in this
> regard, we could also mandate a fixed hash size for the mocked hash, but
> I don't quite feel the need to do so.
> 

Got it. You asked a question, I answered.  The only reason I used this
particular patch to respond was because this particular test had an
error result in my example and there was some concern over the
deterministic results.

>>> The hash size is not really altered very often since you can't really
>>> see a measurable performance benefit in most cases. This is also
>>> strenghtened by the fact, that the hash function is randomized by a
>>> random number, so in real usage you won't be able to deterministically
>>> cause hash collisions.
>>
>> True... at 50 with 4 elements you won't see a resize. You could possibly
>> have a table size of 1 in this instance and be fine, but that only works
>> well for the test! It's more problematic for real world because a table
>> size of 1 grows to 8, 64, etc.  Some would say the minimum table size
>> should be at least 8 if not the next prime, e.g. 11.
> 
> This is not a problem for real use cases. Real use cases don't care
> about the ordering and thus it does not matter whether it's resized or
> not. Hash table is a wrong data structure for lists.
> 
> In the tests it's a bit different. The goal of the testsuite should be
> to allow to write simple tests. Otherwise it will not be easy to
> persuade people to add tests at all. This means that sometimes it's
> necessary to cut corners.
> 
> In case of the testsuite that was failing, the result was correct in any
> ordering. The checker I wrote was dumb though ... really dumb. Just a
> strcmp. I don't want to waste time to write tests to do a full N to N
> checker. That does not make any sense.
> 
> If the code uses a hash table for convenience/performance reasons and
> it's convenient to test the contents of the table in the test suite,
> it's just best to format it to a string and compare it against a file.
> Especially since you get "VIR_TEST_REGEGERATE_OUTPUT" if you happen to
> modify the test. That would be rather compicated with a specific hash
> table checker and make test-writing more painful.
> 
>> In the long run, UUID's are random, but "#blockN" values are far less
>> random - it's the human randomness factor. No one is going to create 100
>> randomly named domain names or volume names, so factoring in the less
>> than randomly named elements is what I've been pondering.
> 
> This also was covered by dan's statement about the hashing function.
> 

The whole prime number thing is just one of those I recall using a prime
number for the divisor in a hashing algorithm is considered better - in
this case I hadn't considered the CodeGen half of the equation. It
wasn't yet on the radar.

>>> In the test code, the cases are added in a deterministic order. Since
>>> the hashing function in the tests is deterministic as well, the only
>>> thing that would influence the ordering is the bucket count. We don't
>>> modify them too often.
>>>
>>> I don't see what would make us alter mocks "just" to get the right
>>> answer here. The reason for this patch was, that the hash function is
>>> using bit operations and then treating the result as integers, which
>>> obviously does not work deterministically between big and little endian
>>> arches. Other than that, the test is now fully deterministic.
>>>
>>> Peter
>>>
>>
>> Understood - until someone changes the size of the table to be random or
>> less deterministic, then there's no problem.  But if that happens,
>> there's a problem.
> 
> Umm, why would anybody do that? It would not make any sense to allocate
> it at random. That's plain crazy. For anything other than that, it would
> be deterministic, and thus while the test output might change at the
> point when it would be changed, it would not lose determinism in the
> suite. The only randomization which makes sense is in the hashing
> function itself and we do have that.
> 

Poorly chosen "random" word by me... Perhaps unexpected or larger than
supplied value. We round up in various places for various reasons -
doing the same with a supplied hash table size wouldn't be out of the
question especially since we can "grow" to some larger value. In this
case, I just happened to alter the size to next prime for both create
and grow and fell down into this rabbit hole.

> Also as it was pointed out. The bucket count choice does not really add
> much in terms of distribution of elements to the buckets. Thanks to the
> auto-resize algorithm it's even somewhat resistant to undersized hashes.
> 
> Adding any kind of magic on top of this just does not seem to be worth
> at all.
> 

Fair enough.  I certainly don't want to spend my days characterizing the
performance of various algorithms to alter the size of a hash table that
has a hash algorithm that is distributing the elements rather evenly. I
still ask myself is having 1 chain longer than 8 too few to cause a
grow? And at what inflection point should a grow be done.

John

FWIW: Since it was simple enough, here's some mostly raw data adding 50K
elems to a hash table that starts out with a size of 1.

For every 1000 nbElems, I figured the number of empty buckets, the
number of "full" buckets (e.g. more than 8 elems), and the longest chain
as follows:

When the name is "#blockN":
nbElems=6000 size=4096, empty=955 full=1 longest=9    <== first "full"
nbElems=10000 size=4096, empty=341 full=5 longest=12
nbElems=15000 size=4096, empty=96 full=43 longest=14      inflection
nbElems=16000 size=4096, empty=77 full=68 longest=14  <== between full
nbElems=17000 size=4096, empty=61 full=91 longest=14      and empty
nbElems=18000 size=4096, empty=43 full=131 longest=14
nbElems=19000 size=4096, empty=36 full=181 longest=14
nbElems=20000 size=4096, empty=32 full=239 longest=15
nbElems=25000 size=4096, empty=9 full=637 longest=16
nbElems=30000 size=4096, empty=2 full=1287 longest=18
nbElems=40000 size=4096, empty=1 full=2649 longest=23
nbElems=50000 size=4096, empty=0 full=3508 longest=27

FWIW: Using an algorithm to have a prime "size" value divisor allows a
higher maximum size of 5779 buckets and the first "full" shows up
between 13K and 14K... Somewhere around 23K is the inflection between
empty and full... At 50K elems we get empty=1 full=2873 longest=22

For a randomly generated UUID:
nbElems=7000 size=4096, empty=736 full=1 longest=9    <== first "full"
nbElems=10000 size=4096, empty=360 full=5 longest=10
nbElems=15000 size=4096, empty=121 full=50 longest=13
nbElems=16000 size=4096, empty=99 full=80 longest=13      inflection
nbElems=17000 size=4096, empty=84 full=105 longest=13 <== between full
nbElems=18000 size=4096, empty=69 full=137 longest=14     and empty
nbElems=19000 size=4096, empty=55 full=177 longest=14
nbElems=20000 size=4096, empty=49 full=243 longest=14
nbElems=25000 size=4096, empty=13 full=649 longest=16
nbElems=30000 size=4096, empty=3 full=1276 longest=18
nbElems=40000 size=4096, empty=1 full=2634 longest=21
nbElems=50000 size=4096, empty=0 full=3513 longest=26

FWIW: Using the prime size algorithm... Between 9K and 10K we get our
first full, somewhere around 22K to 23K we get the inflection between
empty and full, and at 50K elems we get empty=2 full=2903 longest=22.

So the prime number does effect things a bit, but it doesn't really seem
to be significant.

Next, I started to tinker with the thought that perhaps not increasing
the size as soon as we get 1 bucket longer than 8 elems, but rather
trying to only grow when say 10% of the buckets were full or if any one
bucket had more than 16 elems and not have a cap to size as well as
perhaps growing the buckets more slowly once you reach certain nbElems.

For the #blockN's:
nbElems=10000 size=2459, empty=46 full=62 longest=14
nbElems=15000 size=3079, empty=19 full=164 longest=15
nbElems=20000 size=6029, empty=209 full=43 longest=12
nbElems=25000 size=7537, empty=273 full=52 longest=14
nbElems=30000 size=7537, empty=130 full=156 longest=15
nbElems=40000 size=9421, empty=140 full=282 longest=15
nbElems=50000 size=11777, empty=151 full=363 longest=15

@19507 elems is when we grow from 4813 to 6029 buckets
@21288 elems is when we grow from 6029 to 7537 buckets
@30203 elems is when we grow from 7537 to 9421 buckets
@46589 elems is when we grow from 9421 to 11777 buckets

For the UUID's:
nbElems=10000 size=2459, empty=38 full=65 longest=13
nbElems=15000 size=3079, empty=24 full=172 longest=15
nbElems=20000 size=4813, empty=80 full=133 longest=16
nbElems=25000 size=4813, empty=27 full=412 longest=16
nbElems=30000 size=6029, empty=47 full=426 longest=15
nbElems=40000 size=9421, empty=120 full=292 longest=15
nbElems=50000 size=11777, empty=174 full=380 longest=15

@25078 elems is when we grow from 4813 to 6029 buckets
@30409 elems is when we grow from 6029 to 7537 buckets
@35473 elems is when we grow from 7537 to 9421 buckets
@41512 elems is when we grow from 9421 to 11777 buckets

FWIW: For this round I also tracked the number of items in various
buckets when a resize occurred... It was a fairly standard deviation
with the largest number of buckets having 4-7 elements in their chains.
Furthermore the number of buckets with more than 10 elements was minimal.

That made me wonder if we shouldn't grow unless there was more than
perhaps 16 elems in a bucket... Taking that approach grows the table
beyond 4096 buckets only when at greater than 26K elements. Space vs.
time... We're not even close to needing 10's of thousands yet, but it
was to a degree an interesting exercise.

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH 3/3] tests: deterministichash: Make hash tables arch-independent

Posted by Peter Krempa 7 years, 10 months ago

On Tue, Aug 15, 2017 at 16:07:16 -0400, John Ferlan wrote:
> On 08/15/2017 11:47 AM, Peter Krempa wrote:
> > On Tue, Aug 15, 2017 at 10:02:38 -0400, John Ferlan wrote:
> >> On 08/15/2017 08:01 AM, Peter Krempa wrote:
> >>> On Mon, Aug 14, 2017 at 20:46:10 -0400, John Ferlan wrote:
> >>>> On 08/03/2017 04:50 AM, Peter Krempa wrote:

[...]

> Fair enough.  I certainly don't want to spend my days characterizing the
> performance of various algorithms to alter the size of a hash table that
> has a hash algorithm that is distributing the elements rather evenly. I
> still ask myself is having 1 chain longer than 8 too few to cause a
> grow? And at what inflection point should a grow be done.

[...]

> That made me wonder if we shouldn't grow unless there was more than
> perhaps 16 elems in a bucket... Taking that approach grows the table

If you are going to try to optimize this, you certainly will need to
characterize the function and also determine the split betwen additions
and lookups and the probabilty of a grow which is computationally
intensive.

> beyond 4096 buckets only when at greater than 26K elements. Space vs.
> time... We're not even close to needing 10's of thousands yet, but it
> was to a degree an interesting exercise.

For most of our uses a linear list would be more than enough. The hash
table has a convenient API though if you need to access elements by text
keys. I don't think libvirt will ever reach the point when the hash
table would be a problem for an operation.
--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

[libvirt] [PATCH 1/3] util: hash: Include stdbool.h in the header file
[libvirt] [PATCH 2/3] util: hash: Make virHashCodeGen mockable
[libvirt] [PATCH 3/3] tests: deterministichash: Make hash tables arch-independent