[Date Prev][Date Next][Thread Prev][Thread Next][Author Index][Date Index][Thread Index]
Re: Non-ASCII characters in Green
- Subject: Re: Non-ASCII characters in Green
- From: Roger Gregory <roger@xxxxxxxxxxxxxxxxxxxxx>
- Date: Sun, 11 May 2003 19:47:36 -0700
- Cc: udanax@xxxxxxxxxx
- References: <20030507101048.GA5199@nomad> <3EBB3F9F.5080601@xxxxxxxxxx>
Jeff Rush wrote:
arbitrary byte sequence as the server doesn't trim high order bits. I
presume that utf-8 strings don't have an embedded zero byte that would
mess up C code.
The green code doesn't use C-strings anywhere. Green was sdsigned to
contain anything including binaries and graphics, so it couldn't be
allowed to fail on nul bytes. The problem with utf is that it is
Unicode Transport Format, an addin that has the same kind of mixed
bytecode sizes as shiftJIS and other imoralities. In Gold all this could
be taken care of by having the 8 bit characters in 8 bit space and the
16 a& 32 bit characters in 16&32 bit spaces. However green doesn't have
such, so the trick is to simulate them with link indications and
conventions. However, if one just threw the utf encodeings into green
you should expect that green would expect there to be nbytes of content.
Now it one interpreted the bytestring into a UTF characterstring, one
would have to map the discrepencices between nbytes and nchars to track
the display positions. Note that for editing this map needs to be
editable, and has to display whatever link annotations there are
correctly, but it shouldn't be too difficult, considering all the other
mapping that needs to be done. This concept is closely related to the
snert concept in the Gold frontend.
Disclaimer, I wrote much of the Green code, but I don't remember most of
the filenames and linenumbers anymore.
I'd look at the portion of the Python code that transmits the (utf-8)
string over the TCP socket and insure that that translation is occurring
correctly. I'm ignorant of what happens when a Unicode string is passed
to a socket write call. You mention you changed String_write() but did
you change String_read() to examine the returned string and treat it as
Unicode as appropriate?
I would follow Jeff's advice, and check if anywhere in the stream strips
the 8th bit, also insert some 16 bit characters and see if they get
displayed as 2 characters (they should be, but some characters may not
display), then track where those data get lost. Keep us posted, this is
good work, and I intend to find the known bugs in Green someday when I
get to a stable place on my rocket project (see www.halfwaytoanywhere.com)
Re how to support it in the bigger scheme, the original Xanadoers
believed that it ought to be transparent to the backend and to indicate
whether the byte sequence in a particular document is 8 or 16 bits or
encoded in some manner, a link would be added by the front-end
indicating that. Of course all front-ends must then query for and
respect that link, but no such standardization has yet been done.
I wonder whether 16-bit chars ought to be done with a different resource
type (1 = bytes, 2 = links, 3 = words) so that it isn't even possible to
address the bytes out-of-phase as you could using a link-type. I
wouldn't use a different resource type for each encoding though, just
for each physical chunk size.
After doing this I was able to insert some German text, but when I
reloaded it, regular ASCII characters were substituted. For example,
a 223 LATIN SMALL LETTER SHARP S became 67 LATIN CAPITAL LETTER C.
San Francisco ca 94132
415 664-6850 home
the software but not the name!
the name but not the software!