[Date Prev][Date Next][Thread Prev][Thread Next][Author Index][Date Index][Thread Index]

Re: Non-ASCII characters in Green

Subject: Re: Non-ASCII characters in Green
From: Roger Gregory <roger@xxxxxxxxxxxxxxxxxxxxx>
Date: Sun, 11 May 2003 19:47:36 -0700
Cc: udanax@xxxxxxxxxx
References: <20030507101048.GA5199@nomad> <3EBB3F9F.5080601@xxxxxxxxxx>

Jeff Rush wrote:

arbitrary byte sequence as the server doesn't trim high order bits. Ipresume that utf-8 strings don't have an embedded zero byte that wouldmess up C code.

The green code doesn't use C-strings anywhere. Green was sdsigned tocontain anything including binaries and graphics, so it couldn't beallowed to fail on nul bytes. The problem with utf is that it isUnicode Transport Format, an addin that has the same kind of mixedbytecode sizes as shiftJIS and other imoralities. In Gold all this couldbe taken care of by having the 8 bit characters in 8 bit space and the16 a& 32 bit characters in 16&32 bit spaces. However green doesn't havesuch, so the trick is to simulate them with link indications andconventions. However, if one just threw the utf encodeings into greenyou should expect that green would expect there to be nbytes of content.Now it one interpreted the bytestring into a UTF characterstring, onewould have to map the discrepencices between nbytes and nchars to trackthe display positions. Note that for editing this map needs to beeditable, and has to display whatever link annotations there arecorrectly, but it shouldn't be too difficult, considering all the othermapping that needs to be done. This concept is closely related to thesnert concept in the Gold frontend.

Disclaimer, I wrote much of the Green code, but I don't remember most ofthe filenames and linenumbers anymore.

I'd look at the portion of the Python code that transmits the (utf-8)string over the TCP socket and insure that that translation is occurringcorrectly. I'm ignorant of what happens when a Unicode string is passedto a socket write call. You mention you changed String_write() but didyou change String_read() to examine the returned string and treat it asUnicode as appropriate?
Re how to support it in the bigger scheme, the original Xanadoersbelieved that it ought to be transparent to the backend and to indicatewhether the byte sequence in a particular document is 8 or 16 bits orencoded in some manner, a link would be added by the front-endindicating that. Of course all front-ends must then query for andrespect that link, but no such standardization has yet been done.
I wonder whether 16-bit chars ought to be done with a different resourcetype (1 = bytes, 2 = links, 3 = words) so that it isn't even possible toaddress the bytes out-of-phase as you could using a link-type. Iwouldn't use a different resource type for each encoding though, justfor each physical chunk size.
-Jeff
After doing this I was able to insert some German text, but when I
reloaded it, regular ASCII characters were substituted.  For example,
a 223 LATIN SMALL LETTER SHARP S became 67 LATIN CAPITAL LETTER C.
Any ideas?

Regards,

I would follow Jeff's advice, and check if anywhere in the stream stripsthe 8th bit, also insert some 16 bit characters and see if they getdisplayed as 2 characters (they should be, but some characters may notdisplay), then track where those data get lost. Keep us posted, this isgood work, and I intend to find the known bugs in Green someday when Iget to a stable place on my rocket project (see www.halfwaytoanywhere.com)





--
 Roger Gregory
 75 Melba
 San Francisco ca 94132
 415 664-6850 home
 roger@xxxxxxxxxx
 roger@xxxxxxxxxxxxxxxxxxxxx

http://www.halfwaytoanywhere.com

http://www.udanax.com
the software but not the name!
the name but not the software!
http://www.xanadu.com

References:
- Non-ASCII characters in Green
  - From: Aaron Bingham
- Re: Non-ASCII characters in Green
  - From: Jeff Rush

Prev by Date: Re: Non-ASCII characters in Green
Next by Date: repairs to green
Previous by thread: Re: Non-ASCII characters in Green
Next by thread: repairs to green
Index(es):