cxpos is an implementation of this paper:
=========================================
Copyright 2004, D.Sofyan & Adrian Hafizh,
private property of PT SOFTINDO, Jakarta.
All rights reserved.
=========================================


  Title: An analysis of Bound-Range in indexed-character-based string matching scan
  Version: 1.0.0.1-beta
  Authors: D.Sofyan & Adrian Hafizh

  The most important thing when doing a substring scan is to know
  WHEN the character sequences in the substring is no longer
  matched, as early as possible, and if possible, to detect WHAT
  sequences does not worth to scan and could be skipped, as much as
  possible, to minimalize useless, discarded matching tries.

  Based on this postulate, we would analyze the new algorithm of
  character based scanning to quickly locate a position of pattern
  or smaller substring against the larger one, much more efficiently
  compared to that of conventional scan.

  On this paper we propose an improved, sub-optimalization of
  character-based pattern match technique.
  .
  ..
  ...

  First, we have to build an INDEX table of substring to be matched
  against. This table should be initialized in such a way that the
  ordinal position of particular character in the INDEX table
  contains the index/position of that character in the substring,
  or in other words, every characters in the substring, have their
  index/position in the substring recorded in their respective
  ordinal position in the INDEX table. For any other characters that
  are not in the substrings, their ordinal positions in the INDEX
  table should be filled with a value greater than the last position
  of the substring (an OUT-RANGE index-value). From now on, for
  simplicity shake, the ordinal position of particular character's
  value in the INDEX table will be called as only the INDEX-VALUE.

  Conveniently, we use 0-based position for the index/position.

  When a particular character appears more than once in the
  substring, then the ordinal position of that character in the
  INDEX table must contain the highest value of the index/position
  of that particular character in the substring (the last one). On
  further descriptions we will call this kind of characters as TWINS.

  After filling the INDEX table with the proper values, we have to know
  the index-values range by getting the LOWEST and the HIGHEST
  index-value of the characters sequence in the substring (also in
  the INDEX table); the HIGHEST index-value is already indicated by
  the last position of the substring, whereas the LOWEST index-value
  could be gotten by scanning either the substring or the INDEX table,
  From now on, they would simply be called them as the LOWEST and the
  HIGHEST respectively.

  It is also worth to know if there are not any duplicated characters
  (TWINS) in the substring. The search algorithm will be much stright
  and simpler, for example, any out-of-range character catched during
  the skim will result on to rejection of the whole skimmed-substring,
  because the LOWEST would have never been reached anyway upto until
  the last fetch.

// If they are not any duplicated characters in the substring (TWINS):
// Once the INDEX table has been built, we no longer need the
// substring anymore, all of the informations we need are already
// in the INDEX table.

  In this document, we analyze the comprehensive state in general,
  which also sorrounding/including this unique character sequences,
  should the performance is considered to be much critical, we might
  have to do a different branch for this case.

  The other advantages of using the INDEX table is that no-case
  comparison will no longer need different or additional instructions,
  as long as the comparison is based on their index-values, not between
  the characters by themselves, because neither lower or uppercase is
  treated specially, only the index/position of particular character
  in the substring that makes different. .

  On further searching, we should avoid doing comparison of substring
  with string to be matched against by their character per characters
  per-se, instead, we mainly using this INDEX table to do the
  comparison. The characters which contained/existed in the substring
  will have its index-value equal with its position in the substring,
  any character whose index-values higher than the length of substring
  (OUT-RANGE) does not need any further processing, or in the other
  means, only the characters in the substring will be taken into account.

  Second, when doing a skim, we do it in the reversed order, to
  take advantages of the length of the substring to be matched
  against to be accounted in. This reversed way will also lead to
  non-measurable advantages which gain effectiveness when scanning
  a recognizable pattern (a word) against sequences of another
  recognizable pattern (such a document text), because one word has
  a relatively low possibility to match another pattern consisted
  by to half words (in the same length), thus the non-match will
  detected earlier.

  On this algorithm, this reversed way is also aimed at the computation
  on how much characters sequences could be skipped at comparison's fail
  when skimming.

  Finally, mainly by using the index-values we will seek the string to
  be matched against, try to locate the possible matched substring, and
  if possible, skip any sequence of characters that will never match.

  The search begin with comparing the fetched character from the
  string with the first character of the substring (we called further
  the position of the string at this point as ORIGIN), only if they
  are equal then we begin the substring skim cycle, otherwise we
  simply skip to the next character.

// If they are not any duplicated characters (TWINS) in the substring:
// The index-value of 0 has the special position here, because it is
// should be used as initial substring skim cycle. The index-value of 0
// means that the character is the first character in the substring,
// which we will seek it first on the string, if the character fetched
// in the string has the index-value other than 0, we simply skip it
// to fetch the next character, only if its index-value is 0 then we
// begin the substring skim cycle.

  The substring SKIM CYCLE is the process of comparing the CHUNK of
  the string (portion of the string with the length of the substring,
  --will be called as only LENGTH on further descriptions) against
  the substring, on a indexed-character-based comparison. When the
  comparison fails, whe should check WHEN is that happens (denoted
  by the position COUNTER) and WHAT is the INDEX-VALUE (of the
  character fetched from the chunk, at the offset pointed by the
  current counter position). Keep the terminologies in mind, for
  those will be carried on further descriptions.

  On any conditions, the process will keep going on while the
  index-value is equal with the counter, until end-of-loop, that is
  the counter reaches its lowest value, which also means that the
  substring has found. Only when the index-value is different from
  the counter then we have to suspend the process, and do several
  investigation to determine the exact state.

  On any conditions, if the index-value is higher than the HIGHEST
  then the cycle will be ended and the original fetch position could
  be skipped to that point. This conditions indicates that the
  character in question is out of range, the chunk could be safely
  cut to the position after this character.

  The skimming cycle started from the HIGHEST value to the LOWEST
  value (counted-down), this counter should be compared against the
  index-value of current fetched character (original fetched position
  with the offset of counter). There are no physical accesses to the
  substring at this step, the comparisons are merely of index-values
  against the counter.

// If they are not any duplicated characters (TWINS) in the substring:
// We never do physical comparison of the characters at all
// We only need a single comparison stage, any index-value
// different from the counter will lead to suspended skim
// If the index-value is lower than the counter we should shift
// forward the original fetch position by (counter - index-value) away
// if the index-value higher than the counter, then we could
// discard the whole string and skip forward the original fetch
// position to the length of the substring

  On comparison fails, and the index-value is between the LOWEST and
  the HIGHEST, then we have to do the next step, that is manually
  comparing the character from chunk with the substring, both at the
  same counter position, if they are equal then we continue the
  suspended first-step. For the comparison to be carried on further
  processing, then is not the characters itself that we had the
  comparison against to, but rather indirectly by their index-values.

  Note that if both of the characters are equal (by comparing their
  index-values), there is an indication of TWINS, notice also that
  their values will have to be higher than the counter.

  If the comparison fails at this stage, then one way or another,
  the skim have to be ended. But beforewise, we might do another
  check to determine how much save we could get by skipping useless
  further tries.

  At this stage, we virtually try to match the character of
  substring[?] where ? is the index-value of the chunk[COUNTER].
  Any index-values between the LOWEST and COUNTER are considered
  to be OUT-RANGE. Note that the values below the LOWEST will never
  exist, so what we have to consider are only the COUNTER value, and
  (supposed to be) the appearance of the LOWEST, and the values
  between them, if any. The convergency of those lower and
  upper-bound makes the RANGE-bound in effect as described later.

  The RANGE-bound is a virtual range between the LOWEST and the
  (counted-down) HIGHEST, consisted by all of acceptable index-
  values. At first, it is composed equally with the substring. The
  RANGE-bound length is also in correlation with the LENGTH of the
  substring (or the chunk), it supposed to be equal with the
  counter itself, in fact it is, if there are not any TWINS. Due to
  this TWINS problem, the length of RANGE-bound is practically less
  than LENGTH, because of the length of several characters is counted
  as only 1. What we do know is only the upper-bound of the RANGE-bound,
  indicated by the counter (and the expected lower-bound by the LOWEST).

  while skimming, the RANGE-bound length might be virtually decreased
  towards the lower bound according to the counter, yet it physically
  decreases only if the character fetched IS NOT a TWIN. Wen the fetched
  character IS a TWIN, the RANGE-bound decrements virtually mapped to
  a corresponding TWIN value, this map is always an index-value higher
  than the counter NEVER below them, (because of the increasing rules
  of the index-values when they're been built), in consequence of that,
  the lower-bound, would be the last index-value reached.

  The expected to be found: the LOWEST, is the index-value of the
  least and last (the first-reversed) particular character class that
  expected to be found. She is the lower-bound of the range, also
  implies as the assurance of the RANGE-bound length limit, though
  we don't know exactly WHERE she is, neither WHAT of her value, we
  just expect (or more precisely: assumed) that she might have been
  there at one position later, with one particular value. In opposite
  to the upper-bound, that is the counter, we do know exactly
  where and what the value is.

  In conclusion, If the lower-bound has never been reached, then we
  ought to be sure that the length of RANGE-bound is keep intact with
  the length of the substring. In contrast, once the LOWEST has been
  found (reached), then we might not be confident about the integrity
  of (the length of) the RANGE-bound anymore, in fact, the RANGE-bound
  is no longer exists.

  (It's paradoxial, isn't? We are sure about something based on
  what we aren't, yet we aren't when conversely we are).


  Implementation:

  Since any prior index-values higher than the counter have been
  previously mapped to their respective characters, the index-value
  may not be higher than the counter (as noted above, the index-
  value of TWINS, of course would be higher than the counter, but
  consequently, both of the index-values will be once ever equal
  anyway), if they did, then we could safely truncate the original
  fetch position, right after that out-of-range character. Thus
  basically, any OUT-RANGE index-value will result in rejection of
  some portion of the chunk, or precisely, truncation of the chunk
  up to the position of the OUT-RANGE index-value.

  (Please bear in mind that when the LOWEST has been reached, then
  the RANGE-bound is not in effect anymore).

  If an unexpected index-value appears while the LOWEST has not been
  reached yet, then the chunk could be discarded, we do know that
  any unexpected index-value (any index-values not in the RANGE)
  will violate the RANGE-bound. Since we have a confidence that the
  length of RANGE-bound is keep intact with the length of substring
  while the LOWEST has not been reached, then any appearance of
  OUT-RANGE index-values would violate the length-bound. And since
  the LOWEST whose characters implies as the firstly characters at
  the sequence are has not been appears yet, we might discard all
  the characters sequence that have been passed the prior test from
  beginning of the skim cycle, whereas the remaining, untested
  characters sequence, might also discarded as well because its
  must be shorter than LENGTH as intended by the length of the
  substring. They are all bring us the conclusion that the whole
  chunk could be safely discarded.

  addition:

  If the LOWEST has NEVER been reached then the whole chunk could
  also be discarded, the fetch position could be skipped by the
  length of the substring.

  If the LOWEST has EVER been reached then we should put the chunk's
  character at her ESTIMATED proper place, in other words, shift
  accordingly the original fetch position by length indicated by
  the difference between the counter and the index-value. (As would
  be described later, the shift would be increased by +1). Note that
  we could not truncate the chunk because there could be several
  characters in low order counter which might be matched the index-
  value of TWINS.

  Note: Further checks would be taken place ONLY for inequality.

  if the LOWEST has not been reached yet, it means that the acceptable
  index-value must be fit between LOWEST and counter (an exception
  of this is for TWINS which should have been passed the earlier test).

  if the LOWEST has been reached, it means that the counter position
  (value) is below the lowest posibble index-value, its impossible
  in this position that any lower index-values (compared with the
  counter value) will be found, is not worth to test, it must be
  higher than the counter value, and also, the index-value must be
  equal with the LOWEST (since any TWINS should have been catched
  in the prior comparison test, and so did with the out-range
  characters, --in fact this state would have never been reached in
  the last case), and finally it must be the SECOND LOWEST's
  character found (since the first-one, would have been already
  passed the previous test).

  Nevertheless, at this state we could not discard the chunk
  neither truncate it, since it is possible for all characters
  (any index-values) betwen the LOWEST and HIGHEST might appear,
  (although in this state they must be failed the comparison test).

  Those will bring us a conclusion to shift the original fetch
  position accordingly so the character in question is virtually
  placed BEFORE the LOWEST position (recall that the character is
  the SECOND LOWEST's character found), the shift distance it is
  equal with the distance between counter and the LOWEST plus one.
  It fairly possible that some distance exists between this character
  position with the first LOWEST character position found, but we do
  not have any information about how much of that, the best that we
  could do is to shift the original fetch position as stated above.
  (P = P + LOWEST - COUNTER +1).


  SUMMARY:

  There's always a trade between efficiency and completeness.

  On comparison fails, we could simply fetch the next character
  instead of doing any further test and processing which may or
  may not give any advantages for searching.

  Below are the summary of the steps and tests that could be taken,
  from a simple to a comprehensive ones.

  1. If the comparison fails (the index-value is different from the
  counter) we could simply step one forward the original fetch
  position.

  2. If the comparison fails, we could do a test to check whether the
  index-value is OUT-RANGE (higher than the HIGHEST) or not, and if it
  did, then we could discard the chunk and skip forward that out-range
  character), else we step the original fetch position one forward.
  (higher than HIGHEST -> discard)

  3. If the comparison fails, the index-value higher than the counter,
  then we should get the LOWEST and then test whether she has been
  reached or not by comparing her with the counter, if she was greater
  than the counter, it means that she has been reached, and vice versa.
  If she has NOT been reached yet, then we could discard the chunk,
  else we step the original fetch position one forward.
  (index-value > counter > LOWEST -> discard)

  4. If the comparison fails, we might also do another test, by first
  getting the index-value2 of the character in the substring at the
  counter position, if only both of the index-values are equal then
  the skim continued, else, if the chunk's index-value is out of
  RANGE-bound (higher than the counter), then we could discard the
  chunk and skip the original fetch to position by the LENGTH away,
  else we step the original fetch position one forward.

  5. If the comparison fails, the index-value2 we've got from the
  substring's character at position of the counter is different
  from the chunk's index-value, then we might do a final test, that
  is checking whether the LOWEST has beeen reached by comparing the
  counter with the LOWEST. If the values are different (and it will
  --as described above, the LOWEST's value MUST be higher than the
  counter, and also it will be EQUAL with the chunk's index-value),
  then we should positioned the chunk's character at her ESTIMATED
  proper place, ie. shift the ORIGIN by the difference between the
  counter and the (assumed) index-value plus 1 additional offset.

  Note that the last step is nothing more but an extension of the
  previous one, there is no additional state, neither additional
  burden to the program except for computing the skippable distance.
  Also in practice, we should not wasted any effort to get the LOWEST
  value, for she only exists in concept.


  Ch = SubStr[1]
  P = address(S)
  tail = P + length(S) - length(SubStr) -1
  @Loop:
    inc(P)
    if P > tail then goto @NOTFOUND
    if S[P] <> Ch goto @Loop

  Counter = length(SubStr)
  @SkimLoop:
    dec(Counter)
    if Counter = 0 then goto @FOUND
    I = index-value of S[P]
    if I == Counter then goto @SkimLoop
    (I could be any value excep Counter)

    METHOD1:
    if S[P + Counter] = SubStr[Counter] then goto @SkimLoop
    else inc(P, Counter +1) (equal with @Shift1 below)
    (
     Here we assumed that the LOWEST has (always) been reached,
     then the LOWEST MUST be equal with Counter.
     It is no harm to do a thing like that, since if the LOWEST has
     been reached we could either DISCARD or SHIFT the chunk.
       I > Counter -> DISCARD
       I < Counter -> SHIFT
     The beauty of this simplification is that the Counter
     value will always match the distance that could be skipped.
     On the other hand, with this treatment we might only
     shifting the chunk, when it could be discarded at all.
    )


    METHOD2:
    if I > Counter then begin
      V = Index-value of SubStr[Counter]
      if I == V then goto @SkimLoop
      if Counter > LOWEST then goto @Discard (P = P + HIGEHST +1)
      else begin
        if I > HIGHEST then @Truncate (P = P + Counter)
        else @Shift (by delta value: P = P + Counter - I)
      end
    end
    else (I must be < Counter)
      goto @Shift1 (P = P + Counter +1)


  Jakarta, 2004.10.11
  by: D.Sofyan & Adrian Hafizh