Re: [Snowball-discuss] Unicode

From: Michael Schlenker (schlenk@uni-oldenburg.de)
Date: Fri Feb 22 2002 - 18:03:54 GMT


At 10:35 22.02.2002 -0700, you wrote:

>Unicode is on its way, and I'll outline the proposed solution:
>
>For input I think I'll alter the syntax of hex strings so that internal
>spaces separate characters. This is not upwards-compatible, but I can't
>imagine anyone will be upset.
>
>So hex '0D0A'
>needs to be written hex '0D 0A'
>or even hex 'D A' // leading zeroes can be omitted
>
>A new style 'decimal' will be introduced:
>
> decimal '13 10' // cr lf
>
>Then we allow all values from 0 to 64K-1. values >= 64K produce an error
>message.
>
>For output, the java case is not a problem, since strings are made up of 16
>bit items anyway.
>
>For ANSI C we'll have 3 output styles:
>
>1) 8 bit characters, when reference to a character > 255 is an error. This
>is the default style for output in the ANSI C case, and is what we have at
>the moment.
>
>2) 16 bit characters. The way to get this is to declare all strings in the
>form
>
> static symbol string_37[] = {'f','r','e','d'};
> static symbol string_38[] = {'h','a','r','r','y'};
> ...
>
>and we typedef symbol to 'unsigned short'. When it is typedeffed to
>'unsigned char' we get case (1) again. Of course any of the characters 'f'
>etc may be replaced by a number > 255 to get Unicode characters.
>
>3) UTF-8 encoded 8 bit characters. I believe the only change to the
>generated C is that cursor movements of the form z->c++; and z->c--; need to
>be replaced by function calls that move over 1,2 or 3 bytes to get to the
>next character.
>
>- - - - - -
>
>Does anyone know of a program of the form
>
> convert <input >output -option
>
>where option could be ISOLatin1_to_Unicode, Unicode_to_Windows etc etc? I'll
>have to put something together like this for test purposes.
It's trivial with tcl/tk 8.1 and up (8.3.4 is the recent stable version),
they are fully unicode aware.

Just use:
--------------------------------------------------
#!/usr/local/bin/tclsh83
# should do some simple option processing here, if anyones interested its
trivial, for now just assume argv1 and argv2 are the options needed

package require Tcl 8.1 ;# needs tcl 8.1 +
set inputenc [lindex $argv 1] ;# get inputencoding
set outputenc [lindex $argv 2] ;# get outputencoding
fconfigure stdin -encoding $inputenc ;# configure stdin to use
inputencoding
fconfigure stdout -encoding $outputenc ;# configure stdout to use
outputencoding
fcopy stdin stdout ;# copy stdin to stdout
-----------------------------------------------
Usage:
convert <infile >outfile latin-1 utf-8

To get the supported encodings:
$ tclsh83
% encoding names
cp860 cp861 cp862 cp863 tis-620 cp864 cp865 cp866 gb12345 cp949 cp950 cp869
dingbats ksc5601 macCentEuro cp874 macUkraine jis0201 gb2312 euc-cn euc-jp
macThai iso8859-10 jis0208 iso2022-jp macIceland iso2022 iso8859-13 jis0212
iso8859-14 iso8859-15 cp737 iso8859-16 big5 euc-kr macRomania macTurkish
gb1988 iso2022-kr macGreek ascii cp437 macRoman iso8859-1 iso8859-2
iso8859-3 macCroatian koi8-r iso8859-4 ebcdic iso8859-5 cp1250 macCyrillic
iso8859-6 cp1251 macDingbats koi8-u iso8859-7 cp1252 iso8859-8 cp1253
iso8859-9 cp1254 cp1255 cp850 cp1256 cp932 identity cp1257 cp852 macJapan
cp1258 shiftjis utf-8 cp855 cp936 symbol cp775 unicode cp857

Should be enough for most cases.

Michael Schlenker

(p.s. if you have any questions, just ask me or in comp.lang.tcl . You can
get tcl/tk from sourceforge http://www.sf.net/projects/tcl or from
activestate http://tcl.activestate.com )

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss



This archive was generated by hypermail 2.1.3 : Thu Sep 20 2007 - 12:02:41 BST