[Tfug] OT? "Hamming distance" in non-binary codes? ugu5y4a6

Wed Feb 5 23:29:09 MST 2014

Hi Robert,

On 2/5/2014 10:07 PM, Robert Hunter wrote:
> On Wed, Feb 5, 2014 at 5:48 PM, Bexley Hall<bexley401 at yahoo.com>  wrote:
>
>> I need to come up with a (large) set of identifiers.
>> Ideally, numeric (decimal).  And, as "short" as
>> *practical*.
>
> Human readable?

Yes.  Something that folks would already be accustomed to
"jotting down".  And, as a result, very likely to suffer
from the sort of "penmanship" issues that seem to make it
impossible for person A to RELIABLY read something written
by person B.

> Why decimal?  Why not alphanumeric, for instance?

I was hoping for something that would also be suitable for
entry via telephone (i.e., touch tone).

>> In effect, you want to increase the hamming distance between
>> "valid" identifiers so these sorts of transcription errors
>> are minimized and/or easily recognized ("Um, '33333' is not
>> a valid identifier -- are you sure you don't mean '38838'?").
>
> Hamming distance is the minimum number of substitutions it takes to mutate
> one string to an other.

Yes.  As typically applied to binary coded values, you're dealing
with the number of "bit flips" that are allowed.  If you treat each
such bit-flip as having an equal probability of being "flipped",
then you can determine the probability that "too many" flips have
occurred for you to reliably detect that *any* have occured!

(e.g., with simple parity, any even number of bit flips yields
a "valid" -- i.e., not regarded as having been corrupted -- code!)

> If you're concerned with transcription errors that
> are due to misreads (rather than some random process),

Correct.  Once the data is "digitized", I can take other measures
to preserve/protect it.  The problem happens when a person is
reading an identifier off of one medium and entering it into another
(e.g., reading a *typewritten* identifier -- in some typeface that
has not been chosen with particular emphasis on the differentiability
of numeric glyphs -- and typing it into a computer;  worse yet,
reading an identifier off of a hand-written note -- which may have
been transcribed from some *other* medium! -- and entering it).

People seem prone to transposing digits.  Or, seeing something that
isn't really there (that's a fleck of dirt on the upper arm of that
'6' that is making it look like a second loop -- for an '8').  And,
these sorts of errors persist even when they "double-check" what they
wrote (because their memory of what they wrote reinforces what they
*think* they should have written!)

> then you should be
> able to state rules that will help avoid this, for example: no two
> identifiers, M and N, exist such that the i-th position of M is '3', and
> that of J is '8'.

Yes.  Codify the types of transcriptions/misreads that are *likely*
to occur and then ensure you never generate identifiers where a
digit in a particular position differs from any other similar
identifier by *just* this position AND where the difference is
attributable to one of these "likely" errors.

But, note that 4 and 9 are rarely "confused" when typewritten
(at least in most of the non-decorative typefaces that I have
encountered and that are, thus, likely to see this application).
Yet, it is not uncommon for many people to *write* 4's that
resemble 9's.  Or, 7's that resemble 1's.  Or...

(I.e., the "likely errors" are a lot harder to codify than you
would expect if just thinking in terms of typefaces).

Other issues (e.g., identifier length, "format", etc.) also play
a role in how reliably a particular identifier can be "recalled"
or "transcribed".

>> Additionally, this helps provide some protection against
>> folks "guessing" valid identifiers.  Credit card issuers
>> (the phreaking topic reminded me of this) exploit this to
>> minimize the chance of a CC account number being "mis-read"
>> (or, mis-heard!) orally, etc.
>
> The checksum is used in credit card numbers is used to protect against
> accidental errors, and is not intended for cryptographic use.  See

The original "calling card code" (phreaking) was actually intended
to make it "impossible" for a person to synthesize a valid "calling
card" by putting lots of holes in the "identifier space".

[calling cards were like credit cards for making phone calls -- back
before the telecom "break-up".  A popular way for college kids with
little disposable income to make expensive long distance calls on
someone else's dime  :> ]

> http://en.wikipedia.org/wiki/Luhn_algorithm.  You could certainly layer
> this on top of some robust transcription scheme.
>
>> Any suggestions as to how to approach this deterministically?
>> Ruling out digits (symbols) that can be misread/miswritten takes
>> a huge cunk out of the identifier space (i.e., each "digit place"
>> can only represent ~5 different symbols instead of *10*).  OTOH,
>> I think smarter coding schemes can provide me with something more
>> than this.  E.g., disallow certain combinations of "susceptible
>> digits" but still allow their use (trivial example: 83 and 38 are
>> valid but 33 and 88 aren't)
>
> This seems likely the subject of existing research.  I would be interested
> if you dig anything up.

<frown>  I haven't stumbled on anything worthwhile, yet.  I would have
thought someone like MS would have dealt with this "scientifically"
with all their 25-30 character "product keys" (Is that a B or an 8?
Q or an O?  An F or a P?  etc. -- wonderful when you get older and
can't read the damn things to begin with!  :< )

[I had to read some product keys that had been "damaged" off CoA's
recently and it was virtually impossible to figure out what some
of the glyphs were due to these sorts of ambiguities... not enough
"physical distance" between the valid glyphs!]