Click here to Skip to main content
15,846,342 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I have a set of nodes (N) arranged into one or more networks. Each network has one or more root nodes. Each node has a unique ID (UID) that is a 32-bit integer. If there were 100,000 nodes that would be a large problem - 1,000,000 would be almost infeasibly gigantic. I need to check if all nodes are connected to at least one root-node, and I'm trying to design a hash function to help me.

My current code starts at each root node and traces all paths, recording each connected node. It does this without hashing, since previously I could rely on the UIDs being relative small and mostly contiguous. Along with a starting offset, it just uses the UID directly to index into a bitmap array. The bitmap is the compared against the members of N to determine which nodes are not connected.

Now, however, I'm coming across situations where there are large gaps in the UIDS (e.g., jumping from 100,000 to 100,000,000) - this make simply increasing the size of the bitmap more and more impractical, hence the need for a hash. In the general case, provided the user hasn't completely messed up, I would expect the number of unconnected nodes to be less than the number of connected ones (but not necessarily).

How do I, for instance, pick how many buckets I should use for hashing to balance memory use against likelihood of collisions? Can this (should this?) be done dynamically? (i.e., should I dynamically allocate my bucket array based on the size of N?) Is it more efficient to start by hashing all my node UIDs into buckets, then clearing the 'connected' hashes as they are discovered? How can I can I determine the best way to handle collisions? Is it possible to create a perfect hash dynamically?

What I have tried:

Nothing much more than Googling. I'm not a computer scientist, so this is a bit new to me. I'm finding the masses of online information a bit hard to digest - hence the appeal to the experts.
Posted
Updated 30-Oct-19 4:57am

If you need to check for the existence or lack there of something in a list and you don't have the memory capacity or it is infeasible to store use Bloom filter - Wikipedia[^] .

Bloom filters will give you a definite negative if it is not encoded in the bloom bits, and a probable positive for it's existence (the probability is tune able).

[hash functions by definition will have collision and there is no "perfect" hash function only statistically even hash functions i.e. no clumping of values]
 
Share this answer
 
Comments
Kyudos 21-Feb-19 18:14pm    
Thanks for this. I looked into it and it would seem to do the trick. But it is also quite complex. So I might just hit my problem with a slow but reliable hammer in the odd cases where my bitmap doesn't work.
Mehdi Gholam 22-Feb-19 1:30am    
How Bloom works is complex, but using it is as simple as a hash function. try this https://gist.github.com/richardkundl/8300092
maintain a list of bitmaps rather than one big bitmap- with each bitmap capable of storing 100,000 bits (ie 12,500 bytes) - have each bitmap in some structure (class etc) which also records the id of the first bit in that bitmap - that will allow for sparse bitmaps pretty efficiently
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900