cxpos is an implementation of this paper: ========================================= Copyright 2004, D.Sofyan & Adrian Hafizh, private property of PT SOFTINDO, Jakarta. All rights reserved. ========================================= Title: An analysis of Bound-Range in indexed-character-based string matching scan Version: 1.0.0.1-beta Authors: D.Sofyan & Adrian Hafizh The most important thing when doing a substring scan is to know WHEN the character sequences in the substring is no longer matched, as early as possible, and if possible, to detect WHAT sequences does not worth to scan and could be skipped, as much as possible, to minimalize useless, discarded matching tries. Based on this postulate, we would analyze the new algorithm of character based scanning to quickly locate a position of pattern or smaller substring against the larger one, much more efficiently compared to that of conventional scan. On this paper we propose an improved, sub-optimalization of character-based pattern match technique. . .. ... First, we have to build an INDEX table of substring to be matched against. This table should be initialized in such a way that the ordinal position of particular character in the INDEX table contains the index/position of that character in the substring, or in other words, every characters in the substring, have their index/position in the substring recorded in their respective ordinal position in the INDEX table. For any other characters that are not in the substrings, their ordinal positions in the INDEX table should be filled with a value greater than the last position of the substring (an OUT-RANGE index-value). From now on, for simplicity shake, the ordinal position of particular character's value in the INDEX table will be called as only the INDEX-VALUE. Conveniently, we use 0-based position for the index/position. When a particular character appears more than once in the substring, then the ordinal position of that character in the INDEX table must contain the highest value of the index/position of that particular character in the substring (the last one). On further descriptions we will call this kind of characters as TWINS. After filling the INDEX table with the proper values, we have to know the index-values range by getting the LOWEST and the HIGHEST index-value of the characters sequence in the substring (also in the INDEX table); the HIGHEST index-value is already indicated by the last position of the substring, whereas the LOWEST index-value could be gotten by scanning either the substring or the INDEX table, From now on, they would simply be called them as the LOWEST and the HIGHEST respectively. It is also worth to know if there are not any duplicated characters (TWINS) in the substring. The search algorithm will be much stright and simpler, for example, any out-of-range character catched during the skim will result on to rejection of the whole skimmed-substring, because the LOWEST would have never been reached anyway upto until the last fetch. // If they are not any duplicated characters in the substring (TWINS): // Once the INDEX table has been built, we no longer need the // substring anymore, all of the informations we need are already // in the INDEX table. In this document, we analyze the comprehensive state in general, which also sorrounding/including this unique character sequences, should the performance is considered to be much critical, we might have to do a different branch for this case. The other advantages of using the INDEX table is that no-case comparison will no longer need different or additional instructions, as long as the comparison is based on their index-values, not between the characters by themselves, because neither lower or uppercase is treated specially, only the index/position of particular character in the substring that makes different. . On further searching, we should avoid doing comparison of substring with string to be matched against by their character per characters per-se, instead, we mainly using this INDEX table to do the comparison. The characters which contained/existed in the substring will have its index-value equal with its position in the substring, any character whose index-values higher than the length of substring (OUT-RANGE) does not need any further processing, or in the other means, only the characters in the substring will be taken into account. Second, when doing a skim, we do it in the reversed order, to take advantages of the length of the substring to be matched against to be accounted in. This reversed way will also lead to non-measurable advantages which gain effectiveness when scanning a recognizable pattern (a word) against sequences of another recognizable pattern (such a document text), because one word has a relatively low possibility to match another pattern consisted by to half words (in the same length), thus the non-match will detected earlier. On this algorithm, this reversed way is also aimed at the computation on how much characters sequences could be skipped at comparison's fail when skimming. Finally, mainly by using the index-values we will seek the string to be matched against, try to locate the possible matched substring, and if possible, skip any sequence of characters that will never match. The search begin with comparing the fetched character from the string with the first character of the substring (we called further the position of the string at this point as ORIGIN), only if they are equal then we begin the substring skim cycle, otherwise we simply skip to the next character. // If they are not any duplicated characters (TWINS) in the substring: // The index-value of 0 has the special position here, because it is // should be used as initial substring skim cycle. The index-value of 0 // means that the character is the first character in the substring, // which we will seek it first on the string, if the character fetched // in the string has the index-value other than 0, we simply skip it // to fetch the next character, only if its index-value is 0 then we // begin the substring skim cycle. The substring SKIM CYCLE is the process of comparing the CHUNK of the string (portion of the string with the length of the substring, --will be called as only LENGTH on further descriptions) against the substring, on a indexed-character-based comparison. When the comparison fails, whe should check WHEN is that happens (denoted by the position COUNTER) and WHAT is the INDEX-VALUE (of the character fetched from the chunk, at the offset pointed by the current counter position). Keep the terminologies in mind, for those will be carried on further descriptions. On any conditions, the process will keep going on while the index-value is equal with the counter, until end-of-loop, that is the counter reaches its lowest value, which also means that the substring has found. Only when the index-value is different from the counter then we have to suspend the process, and do several investigation to determine the exact state. On any conditions, if the index-value is higher than the HIGHEST then the cycle will be ended and the original fetch position could be skipped to that point. This conditions indicates that the character in question is out of range, the chunk could be safely cut to the position after this character. The skimming cycle started from the HIGHEST value to the LOWEST value (counted-down), this counter should be compared against the index-value of current fetched character (original fetched position with the offset of counter). There are no physical accesses to the substring at this step, the comparisons are merely of index-values against the counter. // If they are not any duplicated characters (TWINS) in the substring: // We never do physical comparison of the characters at all // We only need a single comparison stage, any index-value // different from the counter will lead to suspended skim // If the index-value is lower than the counter we should shift // forward the original fetch position by (counter - index-value) away // if the index-value higher than the counter, then we could // discard the whole string and skip forward the original fetch // position to the length of the substring On comparison fails, and the index-value is between the LOWEST and the HIGHEST, then we have to do the next step, that is manually comparing the character from chunk with the substring, both at the same counter position, if they are equal then we continue the suspended first-step. For the comparison to be carried on further processing, then is not the characters itself that we had the comparison against to, but rather indirectly by their index-values. Note that if both of the characters are equal (by comparing their index-values), there is an indication of TWINS, notice also that their values will have to be higher than the counter. If the comparison fails at this stage, then one way or another, the skim have to be ended. But beforewise, we might do another check to determine how much save we could get by skipping useless further tries. At this stage, we virtually try to match the character of substring[?] where ? is the index-value of the chunk[COUNTER]. Any index-values between the LOWEST and COUNTER are considered to be OUT-RANGE. Note that the values below the LOWEST will never exist, so what we have to consider are only the COUNTER value, and (supposed to be) the appearance of the LOWEST, and the values between them, if any. The convergency of those lower and upper-bound makes the RANGE-bound in effect as described later. The RANGE-bound is a virtual range between the LOWEST and the (counted-down) HIGHEST, consisted by all of acceptable index- values. At first, it is composed equally with the substring. The RANGE-bound length is also in correlation with the LENGTH of the substring (or the chunk), it supposed to be equal with the counter itself, in fact it is, if there are not any TWINS. Due to this TWINS problem, the length of RANGE-bound is practically less than LENGTH, because of the length of several characters is counted as only 1. What we do know is only the upper-bound of the RANGE-bound, indicated by the counter (and the expected lower-bound by the LOWEST). while skimming, the RANGE-bound length might be virtually decreased towards the lower bound according to the counter, yet it physically decreases only if the character fetched IS NOT a TWIN. Wen the fetched character IS a TWIN, the RANGE-bound decrements virtually mapped to a corresponding TWIN value, this map is always an index-value higher than the counter NEVER below them, (because of the increasing rules of the index-values when they're been built), in consequence of that, the lower-bound, would be the last index-value reached. The expected to be found: the LOWEST, is the index-value of the least and last (the first-reversed) particular character class that expected to be found. She is the lower-bound of the range, also implies as the assurance of the RANGE-bound length limit, though we don't know exactly WHERE she is, neither WHAT of her value, we just expect (or more precisely: assumed) that she might have been there at one position later, with one particular value. In opposite to the upper-bound, that is the counter, we do know exactly where and what the value is. In conclusion, If the lower-bound has never been reached, then we ought to be sure that the length of RANGE-bound is keep intact with the length of the substring. In contrast, once the LOWEST has been found (reached), then we might not be confident about the integrity of (the length of) the RANGE-bound anymore, in fact, the RANGE-bound is no longer exists. (It's paradoxial, isn't? We are sure about something based on what we aren't, yet we aren't when conversely we are). Implementation: Since any prior index-values higher than the counter have been previously mapped to their respective characters, the index-value may not be higher than the counter (as noted above, the index- value of TWINS, of course would be higher than the counter, but consequently, both of the index-values will be once ever equal anyway), if they did, then we could safely truncate the original fetch position, right after that out-of-range character. Thus basically, any OUT-RANGE index-value will result in rejection of some portion of the chunk, or precisely, truncation of the chunk up to the position of the OUT-RANGE index-value. (Please bear in mind that when the LOWEST has been reached, then the RANGE-bound is not in effect anymore). If an unexpected index-value appears while the LOWEST has not been reached yet, then the chunk could be discarded, we do know that any unexpected index-value (any index-values not in the RANGE) will violate the RANGE-bound. Since we have a confidence that the length of RANGE-bound is keep intact with the length of substring while the LOWEST has not been reached, then any appearance of OUT-RANGE index-values would violate the length-bound. And since the LOWEST whose characters implies as the firstly characters at the sequence are has not been appears yet, we might discard all the characters sequence that have been passed the prior test from beginning of the skim cycle, whereas the remaining, untested characters sequence, might also discarded as well because its must be shorter than LENGTH as intended by the length of the substring. They are all bring us the conclusion that the whole chunk could be safely discarded. addition: If the LOWEST has NEVER been reached then the whole chunk could also be discarded, the fetch position could be skipped by the length of the substring. If the LOWEST has EVER been reached then we should put the chunk's character at her ESTIMATED proper place, in other words, shift accordingly the original fetch position by length indicated by the difference between the counter and the index-value. (As would be described later, the shift would be increased by +1). Note that we could not truncate the chunk because there could be several characters in low order counter which might be matched the index- value of TWINS. Note: Further checks would be taken place ONLY for inequality. if the LOWEST has not been reached yet, it means that the acceptable index-value must be fit between LOWEST and counter (an exception of this is for TWINS which should have been passed the earlier test). if the LOWEST has been reached, it means that the counter position (value) is below the lowest posibble index-value, its impossible in this position that any lower index-values (compared with the counter value) will be found, is not worth to test, it must be higher than the counter value, and also, the index-value must be equal with the LOWEST (since any TWINS should have been catched in the prior comparison test, and so did with the out-range characters, --in fact this state would have never been reached in the last case), and finally it must be the SECOND LOWEST's character found (since the first-one, would have been already passed the previous test). Nevertheless, at this state we could not discard the chunk neither truncate it, since it is possible for all characters (any index-values) betwen the LOWEST and HIGHEST might appear, (although in this state they must be failed the comparison test). Those will bring us a conclusion to shift the original fetch position accordingly so the character in question is virtually placed BEFORE the LOWEST position (recall that the character is the SECOND LOWEST's character found), the shift distance it is equal with the distance between counter and the LOWEST plus one. It fairly possible that some distance exists between this character position with the first LOWEST character position found, but we do not have any information about how much of that, the best that we could do is to shift the original fetch position as stated above. (P = P + LOWEST - COUNTER +1). SUMMARY: There's always a trade between efficiency and completeness. On comparison fails, we could simply fetch the next character instead of doing any further test and processing which may or may not give any advantages for searching. Below are the summary of the steps and tests that could be taken, from a simple to a comprehensive ones. 1. If the comparison fails (the index-value is different from the counter) we could simply step one forward the original fetch position. 2. If the comparison fails, we could do a test to check whether the index-value is OUT-RANGE (higher than the HIGHEST) or not, and if it did, then we could discard the chunk and skip forward that out-range character), else we step the original fetch position one forward. (higher than HIGHEST -> discard) 3. If the comparison fails, the index-value higher than the counter, then we should get the LOWEST and then test whether she has been reached or not by comparing her with the counter, if she was greater than the counter, it means that she has been reached, and vice versa. If she has NOT been reached yet, then we could discard the chunk, else we step the original fetch position one forward. (index-value > counter > LOWEST -> discard) 4. If the comparison fails, we might also do another test, by first getting the index-value2 of the character in the substring at the counter position, if only both of the index-values are equal then the skim continued, else, if the chunk's index-value is out of RANGE-bound (higher than the counter), then we could discard the chunk and skip the original fetch to position by the LENGTH away, else we step the original fetch position one forward. 5. If the comparison fails, the index-value2 we've got from the substring's character at position of the counter is different from the chunk's index-value, then we might do a final test, that is checking whether the LOWEST has beeen reached by comparing the counter with the LOWEST. If the values are different (and it will --as described above, the LOWEST's value MUST be higher than the counter, and also it will be EQUAL with the chunk's index-value), then we should positioned the chunk's character at her ESTIMATED proper place, ie. shift the ORIGIN by the difference between the counter and the (assumed) index-value plus 1 additional offset. Note that the last step is nothing more but an extension of the previous one, there is no additional state, neither additional burden to the program except for computing the skippable distance. Also in practice, we should not wasted any effort to get the LOWEST value, for she only exists in concept. Ch = SubStr[1] P = address(S) tail = P + length(S) - length(SubStr) -1 @Loop: inc(P) if P > tail then goto @NOTFOUND if S[P] <> Ch goto @Loop Counter = length(SubStr) @SkimLoop: dec(Counter) if Counter = 0 then goto @FOUND I = index-value of S[P] if I == Counter then goto @SkimLoop (I could be any value excep Counter) METHOD1: if S[P + Counter] = SubStr[Counter] then goto @SkimLoop else inc(P, Counter +1) (equal with @Shift1 below) ( Here we assumed that the LOWEST has (always) been reached, then the LOWEST MUST be equal with Counter. It is no harm to do a thing like that, since if the LOWEST has been reached we could either DISCARD or SHIFT the chunk. I > Counter -> DISCARD I < Counter -> SHIFT The beauty of this simplification is that the Counter value will always match the distance that could be skipped. On the other hand, with this treatment we might only shifting the chunk, when it could be discarded at all. ) METHOD2: if I > Counter then begin V = Index-value of SubStr[Counter] if I == V then goto @SkimLoop if Counter > LOWEST then goto @Discard (P = P + HIGEHST +1) else begin if I > HIGHEST then @Truncate (P = P + Counter) else @Shift (by delta value: P = P + Counter - I) end end else (I must be < Counter) goto @Shift1 (P = P + Counter +1) Jakarta, 2004.10.11 by: D.Sofyan & Adrian Hafizh