Advantage Developer Zone

 
 
 

Advantage and Unicode

Tuesday, March 02, 2010

Advantage 10 includes three new field types; nChar, nVarChar and nMemo. These field types will be able to store Unicode characters which will greatly expand the number of languages supported by Advantage. Indexes are supported for all of these Unicode field types with over 200 Unicode collations available. In this tech-tip we will discuss what Unicode is and how you can use Unicode characters with Advantage.

What is Unicode

The Unicode standard was introduced in October 1991 and is currently on revision 5.2 as of October 2009. Unicode stores characters as code points which are grouped together using code units. This allows for many characters to be grouped together in a single encoding. Unicode code space is divided into seventeen planes each comprising 65,536 code points or 256 rows of 256 code points. 

Of these seventeen planes only six are currently in use. In most cases developers will be using the Basic Multilingual Plane (BMP) which contains the most characters and is focused primarily of writing systems currently in use. The bulk of the BMP is allocated to Japanese, Korean and Chinese characters. Additional multilingual characters are assigned to the Supplementary Multilingual Plane (SMP). 

The Supplementary Ideographic Plane (SIP) stores pictorial symbols such as hieroglyphics, pictograms and ideographs. The Supplementary Special-purpose Plane (SSP) contains sets of non-printing characters. Two additional planes are currently reserved for private use. This leaves eleven planes unassigned allowing for a virtually limitless expansion of characters in the future.

What do I need to know about Unicode

Unicode code points are simply numbers in a vast range (hex 0 – 10FFFF). These points are generally abbreviated using U- but U+ is used when exactly four digits are used, indicating a code unit. 

For instance ADS using the Latin (Western) character set could be written as U+0041 U+0044 U+0053 or alternately U-000041 U-000044 U-000053. These values map to the standard ASCII codes you may be a bit more familiar with ( A = 65, D = 68, S = 83 ).

Traditionally, DOS and early Windows operating systems used a single byte (8 bits) to store character data. Since a single byte allows only 256 possible values, far fewer than the number of known characters, a code page scheme is used to support different languages. Code page tells the OS how to interprete a certain character value for a particular language. For example, character value 196 (0xC4) is Ä in code page 1252 (Western Latin character set), ? in code page 1251 (Cyrillic character set), and - in code page 437 (IBM PC and MS-DOS OEM-USA character set). Characters encoded using the code page scheme are generally referred to as ANSI/OEM characters. Although code page allows more than 256 characters to be encoded, only one code page can be active at one time. Characters encoded in one code page may not be available in another code page. This makes information exchange difficult. If you take a file that is encoded in code page 1252 on a US Windows PC and open it on a Russian Windows machine, the file may not display correctly if there are character with values above 127. The same problem exists when a database table encoded using one code page is opened on a computer using a different code page. The problem is even more severe with database application because index built using one code page is logically corrupted when used on a computer using a different code page.

Unicode solves this problem by having a unique code point for each distinct character. Using the example above, the Unicode code points for the three characters are U+00C4(Ä), U+0414(?), and U+2500(-). Characters encoded in Unicode will be interpreted and displayed the same regardless of the code pages supported on the operating system. Modern Windows uses Unicode internally and allows files to be saved using either Unicode or code page encoding.

How does it work with Advantage

Advantage 10 includes three new data types which store Unicode characters. nChar is a fixed length Unicode string, nVarChar is a variable length ( trailing spaces are not returned ) Unicode string and nMemo is an unlimited Unicode character string. Each of these fields is supported in SQL WHERE clause, ORDER BY and GROUP BY statements. You can specify Unicode parameter values and use SQL scalar functions.

Advantage stores the Unicode data using UTF-16 encoding. This provides for an efficient use of space and performance. However, because UTF-16 uses two bytes (16 bits) for every character, it is a less efficient mechanism for storing Latin based characters.

Additional files are required to support Unicode, aciu.dll and icudt40l.dat. The aicu.dll contains the Unicode functions used by Advantage and Unicode collations are stored in the .dat file. These files take up approximately 15MB of disk space and must be distributed with both the client and server when using Unicode field types.

Collations and Indexes

Proper indexing requires the use of collations or a sort order. When installing Advantage Database Server you are asked to choose the default collations. These are the collations that will be used whenever an index is created. If you work with multiple languages on a particular server you can dynamically assign collations on a connection or table level. 

You can now specify both a ANSI/OEM and Unicode collation. ARC provides a list of available collations in a dropdown list for both ANSI/OEM and Unicode. Unicode collations include an option for a case insensitive collation sequence.

ANSI/OEM collations are stored in the ADSCollate table and Unicode collations are stored in icudt40l.dat as previously mentioned. You can get a list of available collations using the sp_GetCollations() system procedure. This procedure takes one string parameter which can be used to restrict the list of collations based on the name pattern specified. Passing in an empty string returns all available collations.

Putting it all Together

Once a table is created with Unicode field types you can enter the data directly into the fields. Both Visual Studio and Delphi ( 2009 and newer ) contain full support for Unicode characters. This allows you to read and write Unicode strings with programs written in these development environments.

All of the Advantage clients, with the exception of the Clipper Libraries, support reading and writing of Unicode data. This allows you to use the development platform of your choice to access data stored with any of the supported character sets. The table below contains an nChar field which is used to store Japanese kanji.

Unicode characters can also be inserted into tables using SQL. The Advantage SQL engine includes full support for Unicode literals. Allowing you to insert, update, find and filter Unicode data.

Limitations

The full text search engine does not currently support Unicode characters. Full text support is planned for a future release of Advantage.
You cannot use Unicode strings as field names or database object names. You must still use Latin Based characters when naming table fields and other database objects (i.e. Triggers, Stored Procedures, UDFs).

Conclusion

Unicode allows you to store data in virtually every written language. This can greatly expand your market since you will be able to provide your application to even more countries. These Unicode field types are similar to the Unicode types supported by other database systems making the migration to Advantage easier.

You can see a demonstration of using Unicode with Advantage in this short screencast.