Tag Archives: Unicode

String comparison in OpenInsight – Part 3 – Linguistic Mode

Using Linguistic Mode

In order to utilize this API without impacting current systems we have introduced a new “mode” into OpenInsight that allows you to determine exactly when you wish to enable linguistic support. This mode comprises three elements:

Mode ID – this is the mode itself, which can be one of the following values:
- (0) Normal, non-linguistic mode.
- (1) Linguistic mode.
Mode Flags – A set of bit-wise flags for use with the Linguistic mode.
Mode Locale – A locale identifier for use with the Linguistic mode (defaults to the current user’s locale).

It’s simply a case of setting the mode when you want it to apply to sorting and case-insensitive operations, and turning it off when you don’t. Just like with Extended Precision Mode you can set a default mode for your application and then adjust this at runtime as desired.

(Note that using the Linguistic mode is not affected by OpenInsight’s ANSI or UTF8 mode, as the string comparisons are processed “outside” in Windows itself.)

The following five functions are used to control the Linguistic Mode:

GetDefaultStrCmpMode – returns the default application mode settings.
SetDefaultStrCmpMode – sets the default application mode.
GetStrCmpMode – returns the current mode settings.
SetStrCmpMode – sets the current mode.
GetStrCmpStatus – returns the status of a string comparison operation.

Along with this set of equates:

RTI_STRCMPMODE_EQUATES
MSWIN_COMPARESTRING_EQUATES

Example:

$Insert RTI_StrCmpMode_Equates
$Insert MSWin_CompareString_Equates

// Set the mode to Linguistic, sorting digits as numbers, case-insensitive, 
// and with linguistic casing, using the "en-UK" locale
SCFlags = BitOr( LINGUISTIC_IGNORECASE$, NORM_LINGUISTIC_CASING$ )
SCFlags = BitOr( SCFlags, SORT_DIGITSASNUMBERS$ )

Call SetStrCmpMode( STRCMPMODE_LINGUISTIC$, SCFlags, "en-UK" ) 

// Now do some sorting ...
Call V119( "S", "", "A", "L", data, "" )

Full details on each of these functions can be found at the end of this post, but let’s take a look in more detail at the each of the mode settings:

Mode ID

This is an integer value that controls how string comparisons are made:

When set to “0” then the application will run in “normal” mode, which means that string comparisons will use the methods described in parts 1 and 2 of this series. The Mode Flags and Mode Locale settings are ignored.

When set to “1” the application uses the Windows CompareStringEx function for string comparisons instead. The Mode Flags and the Mode Locale settings will also be used with this.

Mode Flags

This setting is a integer comprising one or more optional bit-flags that are passed to the Windows CompareStringEx function when running in Linguistic Mode (It may be set to 0 to apply the default behavior). A full description of their use can be found in the Microsoft documentation for the CompareStringEx function, but briefly these are:

Flag	Description
LINGUISTIC_IGNORECASE$	Ignore case, as linguistically appropriate.
LINGUISTIC_IGNOREDIACRITIC$	Ignore nonspacing characters, as linguistically appropriate.
NORM_IGNORECASE$	Ignore case.
NORM_IGNOREKANATYPE$	Do not differentiate between hiragana and katakana characters.
NORM_IGNORENONSPACE$	Ignore nonspacing characters.
NORM_IGNORESYMBOLS$	Ignore symbols and punctuation.
NORM_IGNOREWIDTH$	Ignore the difference between half-width and full-width characters.
NORM_LINGUISTIC_CASING$	Use the default linguistic rules for casing, instead of file system rules.
SORT_DIGITSASNUMBERS$	Treat digits as numbers during sorting.
SORT_STRINGSORT$	Treat punctuation the same as symbols.

Mode Locale

This is can be the name of the locale to use (like “en-US”, “de-CH” etc.), or one of the following special values:

“0” or null – Use the current user locale (LOCALE_NAME_USER_DEFAULT).
“1” – Use the current OS locale (LOCALE_NAME_SYSTEM_DEFAULT).
“2” – Use an invariant locale that provides stable locale and calendar data (LOCALE_NAME_INVARIANT)

The Linguistic Mode and Basic+

The following Basic+ operators and functions are affected by the Linguistic Mode :

LT operator
LE operator
EQ operator
NE operator
GE operator
GT operator
_LTC operator
_LEC operator
_EQC operator
_NEC operator
_GEC operator
_GTC operator
IndexC function
V119 function
Locate By statement
LocateC statement

Note that when used with the case-insensitive operators and functions (such as _eqc, IndexC() etc.) the LINGUISTIC_IGNORECASE$ flag is always applied if the NORM_IGNORECASE$ has not been specified.

Performance considerations

Using the Linguistic Mode can impact performance for two reasons:

There is just more work to do – comparison of strings using more complex rules will always be slower that a simple comparison of ordinal byte values or code points.
The strings must be copied and transformed into UTF16 (wide) strings before passing to the Windows CompareStringEx function. While this is not a slow operation in and of itself it will add some overhead.

Because of this Linguistic Mode is not enabled by default – you are free choose when to apply it yourself.

String Comparison Mode functions

GetDefaultStrCmpMode function

This function returns an @fm-delimited dynamic array containing the current default string comparison mode settings for the application in the format:

<1> Mode
<2> Flags
<3> Locale

Example:

$Insert RTI_StrCmpMode_Equates

DefSCM  = GetDefaultStrCmpMode()
DefMode = DefSCM<GETSTRCMPMODE_MODE$>

SetDefaultStrCmpMode function

This function sets the default string comparison mode for an application. The mode is set to these default values for each new request made to the engine (i.e each event or web-request). This is to protect against situations where an error condition could force the engine to abort processing before the mode could be reset, thereby leaving it in an unknown state.

This function takes three arguments:

Name	Description
Mode	Specifies the default mode to set: “0” for Normal mode, or “1” for Linguistic Mode.
Flags	Bitmask integer that specifies the default flags to use when in Linguistic Mode
Locale	Specifies the name of the default locale to use.

Example:

$Insert RTI_StrCmpMode_Equates
$Insert MSWin_CompareString_Equates

// Set the default mode to Linguistic, sorting digits as numbers, using the
// user's locale
SCFlags = SORT_DIGITSASNUMBERS$

Call SetDefaultStrCmpMode( STRCMPMODE_LINGUISTIC$, SCFlags, "" )

GetStrCmpMode function

This function returns an @fm-delimited dynamic array containing the current string comparison mode settings for the application in the format

<1> Mode
<2> Flags
<3> Locale

Example:

$Insert RTI_StrCmpMode_Equates

CurrSCMode = GetStrCmpMode()<GETSTRCMPMODE_MODE$>

SetStrCmpMode function

This function sets the current string comparison mode for an application. Note that the mode is set to the default values for each new request made to the engine (i.e each event or web-request). This is to protect against situations where an error condition could force the engine to abort processing before the mode could be reset, thereby leaving it in an unknown state.

This function takes three arguments:

Name	Description
Mode	Specifies the mode to set: “0” for Normal mode, or “1” for Linguistic Mode.
Flags	Bitmask integer that specifies the flags to use when in Linguistic Mode
Locale	Specifies the name of the locale to use.

Example:

$Insert RTI_StrCmpMode_Equates
$Insert MSWin_CompareString_Equates

// Set the mode to Linguistic, sorting digits as numbers, case-insensitive, 
// and with linguistic casing, using the "en-UK" locale
SCFlags = BitOr( LINGUISTIC_IGNORECASE$, NORM_LINGUISTIC_CASING$ )
SCFlags = BitOr( SCFlags, SORT_DIGITSASNUMBERS$ )

Call SetStrCmpMode( STRCMPMODE_LINGUISTIC$, SCFlags, "en-UK" ) 

// Now do some sorting ...
Call V119( "S", "", "A", "L", data, "" )

GetStrCmpStatus function

While it is unlikely that the CompareStringEx function will raise any errors it is possible if incompatible flags or parameters are used. In this case Windows returns an error code which may be accessed in Basic+ via this function (See the CompareStringEx documentation for more details on error values).

Example:

$Insert RTI_StrCmpMode_Equates
$Insert MSWin_CompareString_Equates

// Set the mode to Linguistic, sorting digits as numbers, case-insensitive, 
// and with linguistic casing, using the "en-UK" locale
SCFlags = BitOr( LINGUISTIC_IGNORECASE$, NORM_LINGUISTIC_CASING$ )
SCFlags = BitOr( SCFlags, SORT_DIGITSASNUMBERS$ )

Call SetStrCmpMode( STRCMPMODE_LINGUISTIC$, SCFlags, "en-UK" ) 

// Now do some sorting ...
Call V119( "S", "", "A", "L", data, "" )

SCError = GetStrCmpStatus()
If SCError Then
   ErrorText = RTI_ErrorText( "WIN", SCError )
End

Conclusion

This concludes this mini-series on OpenInsight string comparison processing. Hopefully you’ll find the new Linguistic Mode useful in your own applications, bearing in mind that some of the custom sorting options, such as “Treat Digits As Numbers”, can have a use in any application beyond simply dealing with non-English language sets.

Some of the more astute readers among you may have noticed that no mention of indexing has been made so far with respect to Linguistic Mode. This is because work is currently ongoing in this part of the system, and we’ll give you more details regarding this at a later date.

String comparison in OpenInsight – Part 2 – UTF8 Mode

1 Reply

Welcome to the second part of our mini-series explaining the mechanics of how string comparisons are handled in OpenInsight. In our previous post we looked at the inner workings when running in ANSI mode – this time we’ll look at UTF8 mode instead.

(Note that we’ve included some Basic+ pseudo-code examples in this post to illustrate more clearly how some parts of the comparison routines work. These are simplifications of the actual C++ internal functions and not actual code from the system itself.)

String comparison in UTF8 Mode

In UTF8 mode characters can be multi-byte and therefore have a value greater than 255 (normally referred to as their “code point”, or in Basic+ terms, the Seq() value of a character), so this means that the standard ANSI-mode method described previously cannot be used. Instead, a slightly different approach is taken to allow higher code points to be included in custom sorting.

When the system is loaded the UTF8 library creates an internal character-map (called the “ANSI-map”) which is a 256-element array (0-255) of code-point values. This is initialized to the same values as the standard ANSI character set, i.e. position 65 will have the code point for the ANSI character with the value of 65, position 230 will have the code point for the ANSI character with the value of 230 and so on.

This ANSI-map this can be changed at runtime so that code points that are higher than 255 can be included, and code points that appear in the ANSI-map are always sorted lower than those that aren’t, regardless of their actual value. The following functions (exported from RevUTF8.dll) are used to query and update the ANSI-map:

GetAnsiToUnicode – returns the code point for a specified map element.

// MapIndex - must be an integer between 0 and 255 
CodePoint = GetAnsiToUnicode( MapIndex )

SetAnsiToUnicode – updates the code point for a specified map element.

// MapIndex - must be an integer between 0 and 255
// NewCodePoint - integer value of the code point to set
Call SetAnsiToUnicode( MapIndex, NewCodePoint )

UTF8 comparison method

When comparing two characters we first need to find a “sort index” for a character which is determined as follows:

Get the code point value for the character being compared.
Look in the ANSI-map using the low byte value of the code point as the index. If the value at that position is the same as the character code point then the sort index is set to that index and it is marked as “found”.
- E.g. If the character has a code point value of 458 (0x1CA) then it’s low-byte value is 202 (0xCA). If the ANSI-map contains the value 458 at index 202 then the sort index is set to 202 and it is marked as “found”.
Otherwise, scan backwards through the ANSI-map looking for an element that has the same value as the code-point for the character. If we match it then the sort index is set to the same position and it is marked as “found”.

// Pseudo-code
dim ansiMap( 255 )

sortIndex = -1 ; // Not found
codePoint = seq( ch )
testIndex = bitAnd( codePoint1, 0xFF )
if ( ansiMap( testIndex ) == codePoint ) then
   // Found
   sortIndex = testIndex
end else
   // Not found
   for testIndex = 255 to 0 step -1
      if ( ansiMap( testIndex ) == codePoint ) then
         // Found and exit loop
         sortIndex = testIndex
      end         
   next
end

Once this has been done for both characters we use the following comparison procedure:

If one of the characters is marked as “not found” and the other as “found”, the latter is always sorted before the former.
Otherwise we now proceed in a manner similar to the ANSI comparison:
- If we are using a collation sequence the sorting value for each character is extracted from the appropriate sequence using the sort index we determined above.
  - E.g. if the sort index was 202 then the sort value for the comparison is the value of the byte at position 203 (1-based) in the sequence.
- If we are using a case-insensitive comparison without a collation sequence the two sort indexes (not values!) are masked with 0xDF and compared.
- If we are using a case-sensitive comparison without a collation sequence the two original code-point values are compared.

// Pseudo-code
begin case
   case ( sortIndex1 == -1 ) and ( sortIndex2 == -1 )
      // Both Non-ANSI-mapped - use a simple code point compare
      cmpVal = codePoint1 - codePoint2
   case ( sortIndex1 == -1 )
      // sortIndex2 was found in the ANSI map so it's sorted lower
      cmpVal = 1
   case ( sortIndex2 == -1 )
      // sortIndex1 was found in the ANSI map so it's sorted lower
      cmpVal = -1
   case OTHERWISE$
      // Both are ANSI mapped
      begin case
         case hasCollationSequence
            sortVal1 = seq( collationSequence[sortIndex1+1,1] )
            sortVal2 = seq( collationSequence[sortIndex2+1,1] )
            
         case isCaseInsensitive         
            sortVal1 = bitAnd( sortIndex1, 0xDF )
            sortVal2 = bitAnd( sortIndex2, 0xDF )
            
         case OTHERWISE$
            sortVal1 = codePoint1
            sortVal2 = codePoint2
            
      end case
      
      cmpVal = sortVal1 - sortVal2
      
end case

So, this system works pretty well out of the box for languages that can be expressed using the ANSI character set, but for other languages much of the burden falls on the application developer to maintain and tune the language settings and collation sequences to their requirements. This could require considerable effort and ignores much of the functionality provided by the OS itself, so in the next post we’ll take a look at how this is being addressed in the next release.

String comparison in OpenInsight – Part 1 – ANSI Mode

2 Replies

Developers with systems that require Unicode processing will be please to know that the next release adds some new and much-needed string comparison functionality, to which we have given the super-catchy title of the “Linguistic String Comparison Mode”. However, before we get into the details of that, it’s worth taking a look at how string comparison is currently handled in the system as it has a huge effect on how your data is sorted, and it seems to be one of those murky and little understood areas which includes arcane terms like “collation sequences” and “ANSI Maps”.

String comparison in ANSI mode

String comparison in ANSI mode (i.e. where every character has a value between 0 and 255) is usually a straightforward exercise of comparing the byte value of characters against each other. However, it is possible to customize this using the classic ARev technique of “collation sequences”, which allow a developer to assign a custom weighting, or “sort value”, to a specific character when it is compared against others.

Collation sequences are contained in records in the SYSENV table with a prefix of “CM_”. By default OpenInsight includes the following “CM_” records:

CM_ANSI
CM_ISO
CM_US

A collation sequence may be attached to a language definition (“LND”) record by specifying the “CM_” key in field <10>, and when you load the LND record the collation sequence is loaded too. For example, in the German LND record (LND_GERMAN_D) you will see CM_ISO has been specified like so:

Each CM record may contain one or two fields, each field containing a block of 256 bytes that form a collation sequence:

<1> Case-sensitive sequence (required)
<2> Case-insensitive sequence (optional)

To give a character a weighting simply find it it’s 1-based index in the sequence, and enter the byte value you want to give it instead. For example, if we wanted to give the “3” character a sort value of “87” then we do the following:

Find the ANSI byte value of the character “3”, which is 51 (0-based value).
In field<1> we replace the character at index 52 (1-based value) with a character that has the ANSI byte value of 87, which is “W”.

Points to note:

The first field (sequence) in a CM record is required, while the case-insensitive sequence is not – the system will use a default method for case-insensitive comparison as discussed below.
A collation sequence must be 256 characters in length. If not the system will not use it.
The last two characters in the sequence (255 and 256, 1-based) are always set to the byte values of 254 and 255 (@fm and @rm).
Some LND records have data in field<11> – this is for historical reasons and may be ignored.

So, now that we know what collation sequences are, we can see how the system uses them at runtime when it needs to compare strings values.

ANSI comparison when using a collation sequence

If a collation sequence has been specified (for either a case-sensitive operation, or a case-insensitive operation) the system uses it to extract the sorting value of the character being processed:

Get the ANSI byte value of the string character being compared.
Use the ANSI byte value as an index into the appropriate collation sequence.
The sorting value for the string character is the byte value at that index.
- E.g. The character “3” has an ANSI byte value of 51 – using this as an index we get byte 52 (1-based) from the collation sequence – the value of that byte is the sorting value.

// Pseudo-code
ansiVal1 = seq( ch1 )
sortVal1 = seq( collationSequence[ansiVal1+1,1] )

ansiVal2 = seq( ch2 )
sortVal2 = seq( collationSequence[ansiVal2+1,1] )

cmpVal   = ( sortVal1 - sortVal2 )

ANSI case-insensitive comparison without a collation sequence

Case-insensitive comparison of two characters without a collation sequence uses the classic ASCII-style technique of bit-masking both character ANSI byte values with a value of 0xDF and then comparing the results.

// Pseudo-code
sortVal1 = bitAnd( seq( ch1 ), 0xDF )
sortVal2 = bitAnd( seq( ch2 ), 0xDF )

cmpVal   = ( sortVal1 - sortVal2 )

ANSI case-sensitive comparison without a collation sequence

Case-sensitive comparison of two characters without a collation sequence simply compares the ANSI byte value of the characters against each other.

// Pseudo-code
cmpVal = ( seq( ch1 ) - seq( ch2 ) )

That concludes this look at ANSI string comparison – in the next post we’ll take a look at how string comparison is handled in UTF8 mode, which is, as you might expect, a little more complex.

Building OpenInsight 11

News and notes from the bleeding edge of Revelation Software development

Tag Archives: Unicode

String comparison in OpenInsight – Part 3 – Linguistic Mode

Using Linguistic Mode

Mode ID

Mode Flags

Mode Locale

The Linguistic Mode and Basic+

Performance considerations

String Comparison Mode functions

Conclusion

Further reading

String comparison in OpenInsight – Part 2 – UTF8 Mode

String comparison in UTF8 Mode

UTF8 comparison method

String comparison in OpenInsight – Part 1 – ANSI Mode

String comparison in ANSI mode

ANSI comparison when using a collation sequence

ANSI case-insensitive comparison without a collation sequence

ANSI case-sensitive comparison without a collation sequence