[Tfug] OT: E-Book position representation

Bexley Hall bexley401 at yahoo.com
Sat Dec 20 10:57:56 MST 2008


Hi, John,

> >Short answer:  use UTF-16 instead of UTF-8 at 
> >*some* level in your data representation!  :>
> 
> I should have been cleared in the first email.  
> My program = my format.  :)  (With some exceptions).
> The current incarnation reads zTXT documents
> (Weasel's native format) and PalmDOC (AKA 
> AportisDoc). Due mostly to Palm limitations,
> both formats are one byte = one character, 
> though you can alter the codepage used for the
> character set.

OK, understood.  I.e., the only error/drift that 
enters into your current implementation (re: 
position_% = byte_offset / length) is any of 
those "set codepage" byte sequences that contribute
to 'length' without representing 'characters'.
The UTF-8 approach just takes this to a whole 
new level of cruft  :-/
 
> I had forgotten about UTF-16.  If I simply decree
> that this is to be the format, might that solve 
> many of my problems in one fell swoop?

Sorry, my comment was intended to highlight the folly
of Unicode wrt this sort of issue.  :<  As with 
everything, it depends on how pedantic you want 
to be -- where you start trading off "correctness"
vs. "sanity"  :-/

E.g., Unicode (UTF-8, UTF-16, UTF-7 -- each is 
just a different encoding for the same "data")
supports the notion of "composing characters".
Here, a "base character" is adorned with one
(or more!) diacritical marks to form the actual
glyph/grapheme that the human user thinks of as
a "letter" (character).  So, you can have two 
or more "unicode characters" (grrrr... I need 
a better taxonomy to explain this) could be
used to represent what a human would consider
as a single "character".

Indeed, those *sequences* could sometimes even 
be represented by a *single* Unicode character
(another sign of Unicode's insanity).  E.g., 
the human user would perceive \u00E3 and
\u0061 \u0303 identically!  :<  Yet the latter 
consumes twice as much resources!

If you enjoy digging through trivialities, grab
a copy of _The Unicode Standard_ and *try* to 
make sense of it  :-(  Aside from all the 
"pretty pictures" (the print edition has glyphs
for all of the assigned character space!), it
gives a good feel for just how bizarre this 
application domain is!  (this might be available
in PDF form on their website)

OTOH, the "Implementation Guidelines" chapter is
probably a worthwhile read.  It's only about 
40 pp and tries to touch on various topics that
most "users" (of the Standard) will have to face
(e.g., sorting, searching, etc.)

> It's been a while since I read up on Unicode, 
> but, IIRC, UTF-16 has a constant 16bit character
> size, yes?

<grin>  Well, ignoring my above comments, "yes".
(hence the source of the original humor)

> Unlike UTF-8's variable size.

UTF-16 gets you past the 1/2/3/4 byte encoding of
UTF-8.  In that sense, it's a big win -- at the
cost of *forcing* you to use two bytes for every
character.

The point of my humor was that people will 
(visually) regard "characters" (graphemes??)
differently than even the "simplicity" of the
UTF-16 encoding as witnessed by the \u00E3 
example I mentioned.

[this, IMO, is important.  *I* think it makes very
apparent the futility of trying to come up with a
*pedantic* implementation.  It's just way too much
work for the improvement that it tries to provide]

[remember this despite my comments/arguments that
follow!]
 
> Hmmm.. would this be
> enough?  I mean, I don't need to support *every*
> language humanity has.  :)

Make sure you support the full Braille character 
set!!  (again, humor.  Let me tell you about my 
Braille LED clock sometime... think about it :> )

> That would get rid of internal location 
> representation issues, efficiency of location
> finding and searching, and probably lots of 
> other things too.

Note that you still won't be "100% perfect".  
But, IMO, you will have taken a big enough "hit"
(i.e., bearing the cost of two-byte representations
for each character whereas most "Western" languages
could probably be more efficiently represented in
half that using the LATIN encoding) to show that
you care" -- without having to kill yourself
trying to eek out every little detail.

[It's worth examining The Standard, though, to 
make sure none of these pesky details trip you up
in your parsing of input, etc.  For example, if you
fall into the middle of a "composed character"
sequence, making sure your parser doesn't get out
of sync and generate a superfluous/nonexistent
character (reading ahead in your reply...)  You
might be able to offload all/most of this 
responsibility to a desktop library at conversion
time ]

> ><frown>  I wonder at the utility of reporting 
> >(and accepting as input!?) fractional percentages.
> >As documents get larger, those fractions lose
> >precision (so, why are they better than just "XX%"?)
> 
> That was a user request.  Originally it was just 
> whole numbers.  It's limited to just a hundredths
> place, but that seems to be a good tradeoff.  That
> offers 10,000 possible "locations" to be specified
> and in most cases that's fine.
> A user can always just write "5" if that's
> all the accuracy they need, too.

I guess it depends on how position manifests itself
in your application.  E.g., I can't imagine there 
being a (visible) cursor that moves along in 
increments of single "letters".  Rather, I would 
think you represent position as "somewhere on this
general screen of text".

I.e., your internal representation probably uses
the offset into the file to figure out where to
start displaying text (perhaps the "position"
corresponds to the top-leftmost character on the
screen -- assuming left-to-right writing order)?
But, if that position is in the middle of a *word*,
presumably you move forward/backward to ensure the
first *whole* word is displayed?

If that is the case, then my point is, there can 
be at least two valid "position codes" for each 
possible *displayable* position (grrr... language 
getting in the way, again, here  :< ).

E.g., *if* you use the "whole word" scheme that I
described, then, given the following text:
   "I awoke this joyous morning
    bright and starry eyed.
    I stepped up to my window
    and threw the shutter wide.
    Upon my sash perched a little bird
    singing with mounting cheer.
    And from the gayness of his note
    I knew that Spring was near."
the positions "2" and "3" will each result in
the screen being presented as:
    "awoke this joyous morning
    ..."
(note the missing "I ") while positions 4 through
9 will each cause:
    "this joyous morning
    ..."
to be displayed (assuming your algorithm advances
the position to the start of the next whole word)

So, this tells you that you really have more
precision than you need (for a given text).
Alternatively, your given precision supports
a resolution applicable to files that are twice
as large!  :>

> >Since *you* initially reported the position in these terms,
> >you can also store the corresponding "character count" at
> >which that "percentage" (offset_t) was achieved.
> >
> >Of course, the first time you see a book, this is a simple
> >calculation:  0% bytes *or* characters!
> >
> >I recognize seeking to a percentage *byte* offset is
> >considerably easier than "walking" to that same
> >*character* offset.  But, you can do both:  spawn a 
> >task that starts counting characters while
> >you, meanwhile, "jump" to the "byte offset" and begin
> >formatting the page.  When the other task finishes, it
> >can reposition the "cursor" more accurately (I realize
> >this could result in a large shift, potentially.  But,
> >if you already have the "character offset" metric 
> >*stored*, the problem goes away).
> >
> >This allows a user a moment to look at the page and
> >refresh his memory as to why this place is (?) 
> >significant.  Since this API ("UPI"??) would be shared
> >with the bookmark feature -- i.e., not just the "last
> >position stored" -- it is possible that the user will
> >decide that this is NOT where he wanted to be.  So,
> >he returns to the bookmark menu and makes another
> >choice, etc. (you then have to decide the most efficient
> >way of redirecting that background task's execution in
> >light of this new goal).

Note that this is in line with my notion that some
"positions" aren't really valid or distinguishable from
others (i.e., if you force the screen to begin on a word
boundary)

> >[frankly, I think if you also store a "character offset"
> >associated with each "position indication", you don't need
> >this extra complexity to be able to give the user a
> >character oriented position indication.   <shrug> ]
> 
> Oooo... that might be a bit more complexity than is needed,
> I think.  I see
> your point about storing extra data to avoid all this extra
> work on the reader
> side of things... but that only works for
> predetermined/calculated
> information, such as bookmarks added when the book text is
> first converted into the native format.

Yes.  But, isn't that always the case?  I.e., if *you*
(your application) have stored the datum, then you know
exactly where you are (in terms of characters, words,
lines of text, etc. -- whatever you happen to have been
computing *while* you got to this point in the text!).
So, you can pick which of these things are worth saving.

Likewise, when the book is first converted/opened, the 
offset is "0" -- characters *and* bytes!

If, hereafter, you update "byte offset" and any other
related metrics consistently at the same time, then you
should be able to use either/all to reconstruct that
position at a later time.  I.e., wrap them in a
"advance_position()" member or thereabouts.

I suspect the cost of saving:
    offset_t position;
vs.
    struct {
      offset_t position;
      ulong char_count;
    }
isn't going to "break the bank" in terms of resource
utilization  :>

[however, it poses another sinister problem -- you
are now storing two different representations of the
"same datum".  This invites problems when/if they
ever get out of sync!  :< ]

> What happens when the user searches for a given
> word.  Finding the data is quick, but then one must
> crawl forward for the proper character position.

That depends on how you implement your search algorithm.
If you are doing a pattern search, then you modify the
algorithm so that it updates the "position" datum as
it skips along through the data (I assume, here, that
you aren't using a naive "sequential" algorithm).

> Of course, I can see the document having a table
> of offsets/character positions so that crawling 
> can proceed from the nearest known location.

If you ignore the complications I mentioned at 
the start of this reply (i.e., diacriticals),
then you can modify the code that does your search
to also maintain a "position" datum.  I.e., if 
your algorithm (e.g., Boyer-Moore) says, "based
on the pattern being recognized and the text I
am comparing against, you can safely skip forward
6 characters since I *know* there won't be a match
between here and there", then you can advance
"position" by that count of "6" while you are 
also advancing the "W_char* pointer" by six.

> Thankfully, most of that should be moot by
> using UTF-16.

<grin>  Well... if you decide UTF-16 is good enough,
then you can just use pointer arithmetic.

[page numbers...]
 
> >Agreed.  But, I suspect it gives users an easier 
> >way to relate to their position than your "more
> >precise" percentage indicator.
> >
> >For example, if I am 123 pages into a 240 page 
> >book, I know I am "about halfway".  Or, "less
> >than half remains".  Or, "Only about 120 pages
> >left", etc.  For those of us who grew up with
> >paper, I think this is more in tune with how 
> >we *think* of reading a book.  "I've only got
> >6 more pages to go" vs. "I've only got 14.57%
> >remaining".  The latter is meaningless to me
> >*except* as a pure "relative position indicator".
> >It doesn't reflect the size of the document so
> >I can't use it to gauge my investment, commitment,
> >etc.
> >
> >We have our own concept of what a "page" is.
> >Even if the document redefines it for us!
> 
> Hmmm... that's a good point, but you're also 
> right about the concept of a page changing.  
> And it's worse than that.  What exactly *is* a
> page in an electronic medium?  The book is just
> one long stream of text, after all.

I think, here, you have to be careful about terms.
I.e., if the document contains structural elements
that define it's "pages", then *that* is what a 
page is!  E.g., if a journal article is converted 
to electronic form, the conversion process will
often preserve this "page information" since that
is how *other* references to this journal will
be expressed.

OTOH, there is the concept of a "screen" which 
reflects some dynamic concept *similar* to a 
"physical page" -- but *different*.  I think you
will have to come up with a taxonomy that makes
sense for your application and then rigidly stick
to that in your descriptions to users.  As long
as the user can relate to it in some intuitive
way, then I think he/she will embrace your
definition.

If, for example, you claim that a "fraggle" is
1047 characters, your users' eyes will glaze
over; "what the hell is *that* and why should
I care???"

OTOH, I think you can express position as
"screen M of N" -- and N can change if the
user reformats the document!  I think a user
would understand this since he can see the
difference in the "current display" vs. the
"previous display" and correlate that with the
changes in "N".


> Is it how much can fit on the device's display?
> But then what to do when the user alters the
> line spacing or font choice?

Exactly.  Or, "window size".

In this case, instead of:
  "You are on page 5 of 23"
the status line would proclaim
  "You are on page 6 of 30"

> Suddenly the definition of a page has changed.

I would argue that what has changed is *not* 
the "page size".  Rather, it is some application
specific metric (call it "screen number" instead 
of "page number"?)

For example, The Bible has a means of uniquely 
and unambiguously identifying "where you are" 
regardless of how wide the "columns" are in a
particular edition.  (someone can correct me if
I am wrong on this?)

So, *that* mechanism is used to convey position
information "portably" to other people... while
*you* might refer to "which *page*" you were on
in *your* particular copy that you are reading.

Other disciplines use other schemes to uniquely 
and unambiguously represent "positions".  E.g.,
technical specifications are generously salted 
with sections, subsections, subsubsections, etc.

> This makes page numbers useless for anything 
> other than a "you are here" message.  You 
> can't let a user use them to move to a 
> particular location in a book nor use them 
> as bookmark anchors because they can too easily
> change.

Sure you can!  They are simply different ways
of expressing the same information!  If I
say that I want to go to "page (screen) 23",
then that is in the context of the current
page (screen) size!  (again, I dislike having
to use the word "page" in this context -- see
below)

You can't just *store* that "23" as a definitve
position indicator *unless* you also store
the "screen size" associated with it.
(I am just being argumentative, here.  I don't
advocate using "screens" as the basis of
storing position data -- unless your application
enforces this concept in some absolute sense)


But, if the document has page information embedded
in it, I would argue this is worth conveying 
to the user.  For the same reason that *chapter*
numbers (if present) are worth presenting.

I think you have to view your application in 
much the same context as a *browser*.  E.g., 
you don't think of where you are in a scrolled
(web)page as "page M of N".  The "(web)page"
is whatever the author decided that "page"
should be -- even if it takes 40 screens to 
view it all!  The page itself is still 
addressable.  The *browser* is the only thing 
that knows about "which screenful of data" 
to display.

> It would require a lot of processing to move
> to a given page because the beginning offsets
> of pages are not constant.
> 
> Another problem with "pages" is the autoscroll
> feature.  Personally, I don't
> use this, but many users seem to like it a lot.  This is
> where the text slowly scrolls up the screen (or line
> by line, depending on how it's configured).
> Where are you when you stop inbetween pages?  You could

That depends on the model you adopt for "position"
E.g., *when* are you "at midnight"?

> reposition to the start of the current page, but,
> as you mentioned above, this is something
> users are quite vocal about not liking.  :)
> 
> >E.g., if I read a technical journal with lots of multicolumn
> >fine print, my page size notion is much larger than when
> >reading a paperback novel.  Reading that same novel in a
> >hard cover edition gives me yet another notion of page size.
> >Yet, since I know how many pages there are in *this* particular
> >document, I can quickly form a gut feel for my progress through
> >the document.  "I read about a page per minute", etc.
> >
> >[N.B. For some types of documents, this might be a valuable
> >metric to compute and convey (in some form) to the user.  E.g.,
> >"expected time remaining"  <grin> ]
> 
> I can see offering it to users as a purely convenience
> feature.  You are on "page" 42 of 528.

Sorry.  Perhaps I wasn't clear.  :<  I meant:

"At this rate, you should be finished in 3 hours and 23 minutes"
or:
"You should be finished at 4:37PM"

I.e., some way of helping the user gauge the "commitment"
remaining for him/her to finish the text.

You could carry this to increasing levels of detail.

E.g., if the document contains stuctural information
(pages, chapters, sections, etc.) you could convey how
much time is estimated until the reader gets to the
end of the current chapter, end of the book, etc.

I mention this because, when reading, I often find myself
trying to budget how much *longer* I can afford to keep
up this activity (before I must move on to something
else "more pressing").  So, I will flip pages to see
where the next chapter begins -- in an attempt to
figure out how many more pages I would have to commit
to reading before I reached that point.  I am very
concious of my reading speed in any given context so
I can quickly determine how much *time* this represents.

Based on that calculation, I might opt to "finish the
chapter" *or*, maybe just stop where I *am* -- because
there is too much remaining in the chapter and "this
spot is as good as any *other* spot to stop..."

Consider people who read at bedtime; this gives them
a way to decide if they want to stick with the book
a little while longer *or* just roll over and call
it a night...

> I worry, though, that offering this detail will result in
> many rejected feature requests for movement in the program
> via page numbers.

You could likewise argue against requests to move
"by chapter numbers", etc.

Decide what structural elements you want to support
and what the user would *expect* you to support.
Then, implement whatever portion of those "make sense".
E.g., I would be hard pressed to support navigation
by "paragraph number" -- unless paragraphs happened
to be a tagged element, etc.

> >Such a slider needs to have a "transmission" associated with it;
> >the user needs to be able to downshift to get more precise
> >control as well as upshift to get coarser, more rapid movement.
> >With each movement, if you could (ideally) repaint the screen
> >so he/she could "get their (relative) bearings"...
> >
> >[You could use speed of gesture to determine which "gear"
> >you are in]
> 
> Yes, that's a good idea.  Designing for a stylus
> isn't too hard (it's very mouselike).  But these new touch 
> interfaces will require new ideas and a lot of fine tuning.

Yes.  I've been moving towards a gestural interface, now.

(i.e., all *relative* motions devoid of absolute position
context)

> Especially problematic as I have no G1 phone at the moment
> and I'll definitely need one so I can "feel" how a progress
> slider operates.  I've got the dev environment and emulator 
> up and running, but dragging with the mouse and with a
> finger are not comprable.

Agreed.  You also need to look at how the "gestures" are
chosen/assigned.  E.g., with a fingertip, "up and down"
are easier than "sideways".

OTOH, some gestures tend to be subconciously associated
with certain types of actions in an application.  E.g.,
"flicking" (flipping) a page forward/backwards may be
more intuitive than moving a finger downwards or upwards.
 
> >I would argue for a character based metric instead of
> >"byte offset" approach.  I (personally) think the added
> >cost to the developer is far outweighed by the intuitive
> >nature of that metric.  I believe you are also far less
> >likely to "surprise" (Principle of Least Surprise) users
> >if they *see* 10 characters on the screen, the cursor on
> >the 3rd character and the position reported as "30%"
> >REGARDLESS OF THE PARTICULAR CHARACTERS PRECEDING AND 
> >FOLLOWING THE CURSOR.
> 
> Yes, I'm going to have to put a lot of thought into the
> tradeoff between accuracy and speed.  This wouldn't have
> been an issue on Palm devices since they're too slow.
> Newer devices are fast enough that it's possible to
> calculate many of these values without the user noticing
> (if you do it properly, that is).

I always start with (what I *think*) the user's expectations.
I don't want to have to introduce a new paradigm and then
*convince* the user to adopt this.  Unless there is something
*revolutionary* about this "new way of thinking", its just
not worth (IMO) the effort trying to *justify* it to users
(i.e. *customers*) who are expecting something else.

For example, when baking, I have a very pronounced
preference for the units of measurement used in the
Rx's.  Butter, for me, is expressed in units of *pounds*,
not *cups* (nor T, t, etc.)!  OTOH, flour is expressed
in cups, not pounds.  Exceptions to these "rules" exist
but only when justified.

E.g., a bread Rx may call for "5 pounds" of flour; this
makes sense (vs. 14 the "equivalent" 14C) because flour
is often purchased in 5 pound sacks.  Likewise, a cake Rx
may call for "1.5T" of butter as "0.46875 pound" (3/64)
would be a nightmare to try to comprehend/measure!

However, my expectations/needs differ from those of
others.  This is especially true on cultural differences.

For example, I once gave someone a recipe that called
for "5 squares" of chocolate.  Apparently, in France,
(my friend was french) there is no concept of a "square
of chocolate" (!).  Had I, instead, said "5 oz" of
chocolate, she could have come up with an appropriate
conversion and then scaled accordingly.

As a result, I have had to rethink how I record my recipes
(as well as deciding which ones are not intended to be
"portable"  :> )

I've encountered poetic justice in exactly this context
recently as I encountered a recipe calling for "1 packet"
of vanilla.  Cripes, what's a *packet*?  Twelve octets???
:-/

> >[I'm sure you've thought about this.  Just consider that
> >you can't control the material that is being *read*.  Are
> >you willing to penalize a user who happens to read lots
> >of technical documentation with fancy mathematical symbols
> >(which don't fit neatly in single bytes) just for the
> >sake of implementation ease?  <shrug>]
> 
> The hell I can't control it!  Oh, wait... I
> can't...  :(

<grin>  Well, you actually can *influence* it...  E.g.,
if you are targeting your software to the reading of
novels, you can argue that support for certain other
features doesn't make sense in that context.  OTOH,
if you later want to repurpose the software for a
more universal application, you may kick yourself for
shortcuts taken now, etc.

<shrug>
 
> >Given some "core" position representation, I think you should
> >also take advantage of whatever *other* structural information
> >is present in the document to convey to the user his/her
> >position *relative* to this framework.
> 
> This is another tradeoff, though this time it is over how
> much time I want to
> invest in the desktop "conversion" program.  The
> format will need to support
> such abilities, and the input must be parsed to extract
> this information.
> 
> The old setup was much simpler...  nearly all input was
> plain text books,  often from Project Gutenberg.  The
> text could be scanned via regex to generate bookmarks
> (from Chapter/Section headings, or whatever). There is
> a feature currently that displays the title of the last
> bookmark you passed.

Ah, but do you store this information when you store the
"position"?  Otherwise, you have to "look backwards" to
figure it out at run time.

My point being, you have (?) augmented your "position"
datum with this "most recent chapter title" string
just like I was advocating you augment it with the
"actual number of characters" metric.

> So, if you had bookmarks on all major structural elements,
> the reader would seem to know your current heading.  It
> seems like a decent tradeoff.

Ah, I see.  The bookmarks are presented in some sort
of "header" in the document?  So, you can just walk
through an (ordered) set of bookmarks looking for the
one whose "position offset" is largest without exceeding
the "current position".

(I had assumed these were "tags" embedded in the text.
I.e., that you would have to *scan* the encoded document
to find where the headings, page numbers, etc. were
stored -- I guess this is what your conversion program
does...?)

> The other issue is how much structure does it make sense to
> store?  A web browser, for example, can be called upon to
> display nearly anything and must be generic enough to do
> so.  The *target* here is reading books.  Given that
> the vast majority of documents read will be books (and

Sorry, I'll try to add/impose more precision on your
description.  By "books", I assume you mean "novels".
More colloquially, "stories".

(By contrast, "books" could refer to technical manuals,
encyclopedia, school books, etc.)

> those typically have simple or minimal internal structure)
> how much effort should be put into
> supporting these more advanced but less used features?

I would argue that if you supported chapters (where present
in the original document) and page numbers (again, where
present in the original), you've covered most "books" in
this application domain.

My comments were intended in a more generic context.
I.e., reading a text book would probably be hampered
if you couldn't find "section 4.7", etc.

> >I.e., if the creator of the document went to the trouble of
> >including this structural information, assume there is some
> >significance to it and try to use it to give the user a
> >framework in which to judge his "true position".
> 
> Fortunately, most ebook formats currently in use don't
> have a whole lot of structure elements to them.  I aim
> to support more formats this time than what
> the Palm version does, but I'm also limited by the
> desire not to spend many hours reverse engineering 
> closed formats.

Yup.  This can be fun -- for a *little* while.  Thereafter,
it's just plain tedious!  :<  There are people who spend
their lives doing this sort of thing  :-/  (talk about a
waste...)

> The primary reason Weasel currently supports PalmDoc files
> is because another GPL'd program already did the
> hardwork of divining the format's dark secrets.

Understood.  Are there any other "back door" approaches you
can use that would make this easier?  E.g., if a particular
device ("reader") has the ability to export text in plain
ASCII, etc. and you just suck that in and massage it...

> >Since you have a GUI available, you could opt for some
> >abstract representation of the document (i.e., one that
> >fits on a single "screen") that outlines the structure 
> >and shows the user's position -- along with those of any
> >additional bookmarks? -- using some legend.
> >
> >This could also serve as a means for letting the user
> >position himself in the document -- give him some sort
> >of "zoom" control so he can see regions in greater detail
> >(this allows the structure of the document to be as
> >fine-grained as the author intended without compromising
> >the presentation based on the characteristics of the 
> >GUI hardware)
> 
> Again, is there a need to go beyond bookmarks for this? 

For your target market, probably not.  Again, my comments
were in the context of a "universal" book reading application
(in which you would want to see all of this structure)

> They have names, record position, and are anchored at 
> various places in the text.  When displayed in order
> you can get this overview of the document, too.  It all
> depends on how well the ebook was constructed.  I've
> seen a great many PalmDoc
> files (and others) that just plain didn't bother to add
> bookmarks.  Then it
> becomes much more necessary to rely on the reader
> program's facilities for moving around the document.
> Jumping to percentages/pages/etc. or searching
> for some string and jumping to that location.

Understood.
 
> Thanks again for the feedback!

Good luck!  Feel free to drop me a line offlist if the
mood strikes you...

--don


      




More information about the tfug mailing list