[Date Prev][Date Next][Thread Prev][Thread Next][Author Index][Date Index][Thread Index]

Hashes and Y2K-thinking


Some time ago I had a discussion with A-J about mediaserver ids, and
that using *only* a content hash as the id would be year 2000 thinking.
I've thought about that, and have come to the conclusion that that
doesn't make much sense to me.

First, consider the following scenario: There are actually two blocks
with the same hash, but with different bytes, and they are distributed
on the network. There are even computers which store both of these
blocks, or are at least aware of both of them (which is necessary for
this to be a problem). Our nice system with the timestamp ensures that
they will have different IDs. Great, no clashes. Now, one of those
computers that is at least aware of both blocks is the computer of S.
Spoofer. S. checks all the block they're aware of for hash clashes, and
finds that those two blocks have the same hash. Being a spoofer, S.
distributes the first block under the second block's id, and the second
block under the first block's id: the spoofing cannot be detected, since
the hashes are the same.

But how probable is this? That depends on the entropy of the hashes; and
frankly, I have no idea about what it is. But it's gotta be pretty high,
even if the entropy of the blocks themselves is low, because hashes are
supposed to be very different even if the hashed things are very
similar. Let me take a wild guess and assume an entropy of 15 bytes in
our 20 byte hash (tell me if that number is so much too big that my
argument doesn't make any sense). That makes 120 bits, i.e. the
probability of an arbitrary block having the same hash as a given block
is 2 ** 120.

Now, let's take a chance of one in a billion that there will be a hash
clash. AFAIR, that's smaller a chance than the ones taken in nuclear
power plant engeneering. 2 ** 30 is more than a billion.

That means that we can tolerate a total of 2 ** 90 blocks in the system,
taking the above chance. That's more than 10 ** 27, which is a hell of a
lot. I think before we come close to that number, we have enough time to
re-think our approach. ;o) Plus, if we add another hash algorithm as we
talked about in Aarhus, the chances get still somewhat smaller.

Finally, there is also another reason why we might want to have
something else but hashes in the ids: we might want to explicitly
distinguish between different (especially small) blocks with the same
bytes. If I type in only "wl" in one session and gzz stores it into one
block, it should have a different span identity from a block containing
only "wl" that some other person entered. However, as long as we store
the metadata in the hashed block, that is not a real problem.

So I propose that the next generation of ids be simpler and only contain
one or more hashes.