[Tfug] Communications reliability

Mon Dec 31 10:25:38 MST 2007

Hi,

I'm working on a distributed application.  I
have done this sort of thing before, *but*,
always in an environment where the interconnect
wiring was considered a *crucial* element of
the system.  I.e., playing a role similar to
the data bus between a CPU and it's memory
(in that you *assume* the integrity of that).
If a cable was cut, unplugged or an interface
died, it was *proper* for the system to "stop
working" (though in an orderly, controlled
fashion).

Now, however, I am dealing with a scenario in 
which nodes may become disconnected accidentally
or *intentionally* and the "system" must cope
with their lack of availability.  I.e., the
features/facilities/capabilities that they
represent/implement are no longer accessible
but the application itself must still continue
to run.

In a very loose sense, this is similar to how
The Internet works:  if a site is "down", then
you (e.g., a web browser) just can't access the
assets of that site -- but the rest of The
Internet is still accesible.

However, (building on that web browser example)
the web browser <-> user interface is really
quite crude in that instance:  "site unavailable".
Worse, yet, this hides a plethora of problems that
might be the cause (i.e., the site might be "up"
but a router between here and there might be 
having difficulties).

Note that browsers (and other network clients
on which they rely) *tend* to silently make 
several repeated attempts to achieve their
goal.  Usually, relying on something as crude
as a time-out to determine when to "fail".

I'm looking for suggestions as to how a "node"
(client *or* server) might more intelligently
do this.

E.g., when *I* encounter a problem with a
web site, I fall back on other tools to try
to determine if there is a *real* problem
(connectivity, etc.) or if this is perhaps
just a temporary overload of some resource
(e.g., network bandwidth, the site's available
computing power, etc.).  This tells me:
- if there is a "real" problem (vs. having a
  time-out that is presently "too quick")
- where that problem might be
- how likely I am to be able to complete my
  request "if I am persistent"

Note that, to some extent, we all do this.
Some folks might hammer away at a site (resource)
until they "get through".  Others, might quickly
shift their attention to an alternate site that
might suffice (e.g., try the next "result" in
a search if the present one is "not answering").

This gets a bit trickier if a machine has to 
make these decisions.  :<  Hence, my search for
responsible algorithms to package this.

E.g., a simple ping can tell you that connectivity
exists and the target *appears* to be "up".  So,
if some other service is not answering, it is
likely that the problem lies in that service...
I can be a bit smarter than this, of course  (e.g.,
look at what other traffic is running on that I/F
and build a tiny expert system to tell me whether
the I/F is at fault, the network, the targeted node,
etc.)

Note that I can access lots of things (data)
that aren't typically checked in the above example
(e.g., I can verify that carrier is present on
an interface to ensure that *my* cable isn't
unplugged, etc.)

Any "more elegant" (or, "more practical") suggestions?

Thx,
--don

      ____________________________________________________________________________________
Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping